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Chapter 1 

Basic Probability Theory 


In this chapter we introduce the mathematical framework of probability theory, which makes it 
possible to reason about uncertainty in a principled way using set theory. Appendix A contains 
a review of basic set-theory concepts. 

1.1 Probability spaces 

Our goal is to build a mathematical framework to represent and analyze uncertain phenomena, 
such as the result of rolling a die, tomorrow’s weather, the result of an NBA game, etc. To this 
end we model the phenomenon of interest as an experiment with several (possibly infinite) 
mutually exclusive outcomes. 

Except in simple cases, when the number of outcomes is small, it is customary to reason about 
sets of outcomes, called events. To quantify how likely it is for the outcome of the experiment 
to belong to a specific event, we assign a probability to the event. More formally, we define 
a measure (recall that a measure is a function that maps sets to real numbers) that assigns 
probabilities to each event of interest. 

More formally, the experiment is characterized by constructing a probability space. 
Definition 1.1.1 (Probability space). A probability space is a triple (P, J-, P) consisting of: 

• A sample space P, which contains all possible outcomes of the experiment. 

• A set of events J~, which must be a a-algebra (see Definition 1.1.2 below). 

• A probability measure P that assigns probabilities to the events in T (see Definition 1.1.4 
below). 

Sample spaces may be discrete or continuous. Examples of discrete sample spaces include the 
possible outcomes of a coin toss, the score of a basketball game, the number of people that show 
up at a party, etc. Continuous sample spaces are usually intervals of M or R n used to model 
time, position, temperature, etc. 

The term er-algebra is used in measure theory to denote a collection of sets that satisfy certain 
conditions listed below. Don’t be too intimidated by it. It is just a sophisticated way of stating 
that if we assign a probability to certain events (for example it will rain tomorrow or it will 
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snow tomorrow ) we also need to assign a probability to their complements (i.e. it will not rain 
tomorrow or it will not snow tomorrow ) and to their union (it will rain or snow tomorrow). 

Definition 1.1.2 (cr-algebra). A a-algebra E is a collection of sets in Q such that: 

1. If a set S £ E then S c E E. 

2. If the sets Si, £2 £ E, then S\ U S 2 £ E. This also holds for infinite sequences; if 
Si, S 2 , ■ . . £ E then Ui^S* £ E. 

3. LleE. 


If our sample space is discrete, a possible choice for the cr-algebra is the power set of the sample 
space, which consists of all possible sets of elements in the sample space. If we are tossing a coin 
and the sample space is 

n := {heads, tails} , (1.1) 

then the power set is a valid a-algebra 

E := {heads or tails, heads, tails, 0} , (1.2) 

where 0 denotes the empty set. However, in many cases cr-algebras do not contain every possible 
set of outcomes. 

Example 1.1.3 (Cholesterol). A doctor is interested in modeling the cholesterol levels of her 
patients probabilistically. Every time a patient visits her, she tests their cholesterol level. Here 
the experiment is the cholesterol test, the outcome is the measured cholesterol level, and the 
sample space Q is the positive real line. The doctor is mainly interested in whether the patients 
to have low, borderline-high, or high cholesterol. The event L (low cholesterol) contains all 
outcomes below 200 mg/dL, the event B (borderline-high cholesterol) contains all outcomes 
between 200 and 240 mg/dL, and the event H (high cholesterol) contains all outcomes above 
240 mg/dL. The a-algebra E of possible events therefore equals 

E := [L U B U H, L U B, L U H, B U H, L, B, H, 0} . (1.3) 

The events are a partition of the sample space, which simplifies deriving the corresponding 
a-algebra. A 


The role of the probability measure P is to quantify how likely we are to encounter each of the 
events in the a-algebra. Intuitively, the probability of an event A can be interpreted as the 
fraction of times that the outcome of the experiment is in A, as the number of repetitions tends 
to infinity. It follows that probabilities should always be nonnegative. Also, if two events A and 
B are disjoint (their intersection is empty), then 


P(AUB) 


outcomes in A or B 
total 

outcomes in A + outcomes in B 
total 

outcomes in A outcomes in B 
total total 

P(A) + P(B). 


(1.4) 

(1.5) 

( 1 . 6 ) 
(1.7) 
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Probabilities of unions of disjoint events should equal the sum of the individual probabilities. 
Additionally, the probability of the whole sample space fl should equal one, as it contains all 
outcomes 


pm 


outcomes in 0 
total 


total 
total 
= 1. 


( 1 . 8 ) 

(1.9) 

( 1 . 10 ) 


These conditions are necessary for a measure to be a valid probability measure. 

Definition 1.1.4 (Probability measure). A probability measure is a function defined over the 
sets in a cr-algebra F such that: 


1. P (Sj > 0 for any event SgT. 

2. If the sets Si, £ 2 , • • •, S n 6 T are disjoint (i.e. Si n Sj = 0 for i 7 ^ j) then 

n 

P (Uf = 1 5i) = ^ P (Si). 

i= 1 

Similarly, for a countably infinite sequence of disjoint sets Sj, S 2 , ■ ■ ■ 6 J 


P ( liiri Uf = 1 Sj) = lim V P (Si ). 


\n—>00 


n—>00 J 

i= 1 


(1.11) 


( 1 . 12 ) 


3. P(fi) = 1. 

The two first axioms capture the intuitive idea that the probability of an event is a measure 
such as mass (or length or volume): just like the mass of any object is nonnegative and the 
total mass of several distinct objects is the sum of their masses, the probability of any event 
is nonnegative and the probability of the union of several disjoint objects is the sum of their 
probabilities. However, in contrast to mass, the amount of probability in an experiment cannot 
be unbounded. If it is highly likely that it will rain tomorrow, then it cannot be also very 
likely that it will not rain. If the probability of an event S is large, then the probability of 
its complement S c must be small. This is captured by the third axiom, which normalizes the 
probability measure (and implies that P ( S c ) = 1 — P (<Sj). 

It is important to stress that the probability measure does not assign probabilities to individual 
outcomes, but rather to events in the cr-algebra. The reason for this is that when the number 
of possible outcomes is uncountably infinite, then one cannot assign nonzero probability to all 
the outcomes and still satisfy the condition P (fl) = 1. This is not an exotic situation, it occurs 
for instance in the cholesterol example where any positive real number is a possible outcome. 
In the case of discrete or countable sample spaces, the cr-algebra may equal the power set of the 
sample space, which means that we do assign probabilities to events that only contain a single 
outcome (e.g. the coin-toss example). 
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Example 1.1.5 (Cholesterol (continued)). A valid probability measure for Example 1.1.3 is 


P (L) = 0.6, P (5) = 0.28, P (if) = 0.12. (1.13) 

Using the properties, we can determine for instance that P {B U H) = 0.6 + 0.28 = 0.88. A 

Definition 1.1.4 has the following consequences: 

P(0) = 0 , (1.14) 

A C B implies P (A) < P (B ), (1.15) 

P(AuB) = P(A) + P(B)-P(AnB). (1.16) 


We omit the proofs (try proving them on your own). 


1.2 Conditional probability 

Conditional probability is a crucial concept in probabilistic modeling. It allows us to update 
probabilistic models when additional information is revealed. Consider a probabilistic space 
(0,J-", P) where we find out that the outcome of the experiment belongs to a certain event 
S € T. This obviously affects how likely it is for any other event S' G J~ to have occurred: we 
can rule out any outcome not belonging to S. The updated probability of each event is known 
as the conditional probability of S' given S. Intuitively, the conditional probability can be 
interpreted as the fraction of outcomes in S that are also in S', 


P (S'\S) = 


outcomes in S' and S 
outcomes in S 
outcomes in S' and S 


total 


total 

P {S' n S) 

P (S) ’ 


outcomes in S 


(1.17) 

(1.18) 
(1.19) 


where we assume that P (S') 7 ^ 0 (later on we will have to deal with the case when S has 
zero probability, which often occurs in continuous probability spaces). The definition is rather 
intuitive: S is now the new sample space, so if the outcome is in S' then it must belong to 
S' n S. However, just using the probability of the intersection would underestimate how likely it 
is for S' to occur because the sample space has been reduced to S. Therefore we normalize by 
the probability of S. As a sanity check, we have P {S\S) = 1 and if S and S' are disjoint then 
P (S'\S) = 0. 

The conditional probability P(-|S) is a valid probability measure in the probability space 
(S, Es, P (-|<5)), where J~s is a u-algebra that contains the intersection of S and the sets in 
T. To simplify notation, when we condition on an intersection of sets we write the conditional 
probability as 


P (S\A, B, C) := P (S\A n B D C ), 


( 1 . 20 ) 


for any events S, A, B, C. 
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Example 1.2.1 (Flights and rain). JFK airport hires you to estimate how the punctuality of 
flight arrivals is affected by the weather. You begin by defining a probability space for which 
the sample space is 


= {late and rain, late and no rain, on time and rain, on time and no rain} (1.21) 


and the cr-algebra is the power set of 12. From data of past flights you determine that a reasonable 
estimate for the probability measure of the probability space is 


P (late, no rain) = —, 

Zu 

3 

P (late, rain) = —, 


14 

P (on time, no rain) = —, 
v ’ ’ 20 

P (on time, rain) = —. 
v ; 20 


( 1 . 22 ) 

(1.23) 


The airport is interested in the probability of a flight being late if it rains, so you define a new 
probability space conditioning on the event rain. The sample space is the set of all outcomes 
such that rain occurred, the (x-algebra is the power set of {on time, late} and the probability 
measure is P (-|rain). In particular, 


P (late|rain) 


P (late, rain) 
P (rain) 


3/20 _ 3 

3/20 + 1/20 “ 4 


(1.24) 


and similarly P (late|no rain) = 1/8. 


A 


Conditional probabilities can be used to compute the intersection of several events in a structured 
way. By definition, we can express the probability of the intersection of two events A, B E T as 
follows, 


P(AnB) = P(A)P(B|A) (1.25) 

= P(B)P(A\B). (1.26) 

In this formula P (A) is known as the prior probability of A, as it captures the information we 
have about A before anything else is revealed. Analogously, P (A\B) is known as the posterior 
probability. These are fundamental quantities in Bayesian models, discussed in Chapter 10. 
Generalizing (1.25) to a sequence of events gives the chain rule , which allows to express the 
probability of the intersection of multiple events in terms of conditional probabilities. We omit 
the proof, which is a straightforward application of induction. 

Theorem 1.2.2 (Chain rule). Let (f2, J 7 , P) be a probability space and S\, S 2 ,... a collection of 
events in J-, 

p (n iSi) = p (Si) p (s 2 |5r) p (S3IS1 n s 2 ) ■ ■ • (1.27) 

= n p (^! n 5=i^)- ( L28 ) 

i 

Sometimes, estimating the probability of a certain event directly may be more challenging than 
estimating its probability conditioned on simpler events. A collection of disjoint sets A\, A 2 ,... 
such that Ll = U iAi is called a partition of fl. The law of total probability allows us to pool 
conditional probabilities together, weighting them by the probability of the individual events in 
the partition, to compute the probability of the event of interest. 
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Theorem 1.2.3 (Law of total probability). Let P) be a probability space and let the 

collection of disjoint sets A\, A 2 , ... € F be any partition of Cl. For any set S G F 

P(5) = ^P(Sni ; ) (1.29) 

i 

= Y / P(A l )P(S\A i ). (1.30) 

i 

Proof. This is an immediate consequence of the chain rule and Axiom 2 in Definition 1.1.4, since 
S = Uj S n A 4 and the sets S n A* are disjoint. □ 

Example 1.2.4 (Aunt visit). Your aunt is arriving at JFK tomorrow and you would like to 
know how likely it is for her flight to be on time. From Example 1.2.1, you recall that 

P (late|rain) = 0.75, P (late|no rain) = 0.125. (1.31) 

After checking out a weather website, you determine that P (rain) = 0.2. 

Now, how can we integrate all of this information? The events rain and no rain are disjoint and 
cover the whole sample space, so they form a partition. We can consequently apply the law of 
total probability to determine 

P (late) = P (late|rain) P (rain) + P (late|no rain) P (no rain) (1-32) 

= 0.75-0.2+ 0.125-0.8 = 0.25. (1.33) 

So the probability that your aunt’s plane is late is 1/4. 

A 


It is crucial to realize that in general P (A\B) 7 ^ P (B\A): most players in the NBA probably 
own a basketball (P (owns ball|NBA) is large) but most people that own basketballs are not in 
the NBA (P (NBA|owns ball) is small). The reason is that the prior probabilities are very differ¬ 
ent: P (NBA) is much smaller than P (owns ball). However, it is possible to invert conditional 
probabilities, i.e. find P (A\B) from P (B\A), as long as we take into account the priors. This 
straightforward consequence of the definition of conditional probability is known as Bayes’ rule. 

Theorem 1.2.5 (Bayes’ rule). For any events A and B in a probability space (O, F, P) 




(1.34) 


as long as P (B) > 0. 

Example 1.2.6 (Aunt visit (continued)). You explain the probabilistic model described in 
Example 1.2.4 to your cousin Marvin who lives in California. A day later, you tell him that your 
aunt arrived late but you don’t mention whether it rained or not. After he hangs up, Marvin 
wants to figure out the probability that it rained. Recall that the probability of rain was 0.2, 
but since your aunt arrived late he should update the estimate. Applying Bayes’ rule and the 
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law of total probability: 


P (rain|late) 


P (late | rain) P (rain) 
P (late) 


P (late|rain) P (rain) 

P (late | rain) P (rain) + P (late | no rain) P (no rain) 


0.75 • 0.2 

0.75-0.2+ 0.125-0.8 


= 0 . 6 . 


(1.35) 

(1.36) 

(1.37) 


As expected, the probability that it rained increases under the assumption that your aunt is 
late. 

A 


1.3 Independence 

As discussed in the previous section, conditional probabilities quantify the extent to which the 
knowledge of the occurrence of a certain event affects the probability of another event. In some 
cases, it makes no difference: the events are independent. More formally, events A and B are 
independent if and only if 


P (A|R) = P (A). (1.38) 

This definition is not valid if P ( B ) = 0. The following definition covers this case and is otherwise 
equivalent. 

Definition 1.3.1 (Independence). Let (fi, T, P) be a probability space. Two events A, B e T 
are independent if and only if 


P (AnB) = P (A) P (B). 


(1.39) 


Example 1.3.2 (Congress). We consider a data set compiling the votes of members of the 
U.S. House of Representatives on two issues in 1984 1 . The issues are cost sharing for a water 
project (issue 1) and adoption of the budget resolution (issue 2). We model the behavior of 
the congressmen probabilistically, defining a sample space where each outcome is a sequence of 
votes. For instance, a possible outcome is issue 1 = yes, issue 2 = no. We choose the u-algebra 
to be the power set of the sample space. To estimate the probability measure associated to 
different events, we just compute the fraction of their occurrence in the data. 


„ . , members voting yes on issue 1 

P (issue 1 = yes) «---—- 

total votes on issue 1 

= 0.597, 

members voting yes on issue 2 

P (issue 2 = yes) «--- 

total votes on issue 2 

= 0.417, 

members voting yes on issues 1 and 2 

P (issue 1 = yes n issue 2 = yes) «------— 

total members voting on issues 1 and 2 

= 0.069. 


(1.40) 

(1.41) 

(1.42) 

(1.43) 

(1.44) 

(1.45) 


1 The data is available here. 
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Based on these data, we can evaluate whether voting behavior on the two issues was dependent. 
In other words, if we know how a member voted on issue 1, does this provide information about 
how they voted on issue 2? The answer is yes, since 

P (issue 1 = yes) P (issue 2 = yes) = 0.249 (1-46) 

is very different from P (issue 1 = yes n issue 2 = yes). If a member voted yes on issue 1, they 
were less likely to vote yes on issue 2. A 


Similarly, we can define conditional independence between two events given a third event. 
A and B are conditionally independent given C if and only if 

P {A\B, C) = P (A\C) , (1.47) 

where P (A\B, C) := P (A\B n C ). Intuitively, this means that the probability of A is not affected 
by whether B occurs or not, as long as C occurs. 

Definition 1.3.3 (Conditional independence). Let (14, J 7 , P) be a probability space. Two events 
A, B G T are conditionally independent given a third event C E T if and only if 


P [A n B\C) = P (A\C) P (B\C). 


(1.48) 


Example 1.3.4 (Congress (continued)). The main factor that determines how members of 
congress vote is political affiliation. We therefore incorporate it into the probabilistic model in 
Example 1.3.2. Each outcome now consists of the votes for issues 1 and 2, and also the affiliation 
of the member, e.g. issue 1 = yes, issue 2 = no, affiliation = republican, or issue 1 = no, issue 
2 = no, affiliation = democrat. The c-algebra is the power set of the sample space. We again 
estimate the values of the probability measure associated to different events using the data: 


P (issue 1 = yes | republican) 


P (issue 2 = yes | republican) 


P (issue 1 = yes n issue 2 = yes | republican) 


republicans voting yes on issue 1 

total republican votes on issue 1 

0.134, (1.50) 

republicans voting yes on issue 2 (1 51) 

total republican votes on issue 2 

0.988, (1.52) 


republicans voting yes on issues 1 and 2 
republicans voting on both issues 


0.134. 


(1.53) 


Based on these data, we can evaluate whether voting behavior on the two issues was dependent 
conditioned on the member being a republican. In other words, if we know how a member voted 
on issue 1 and that they are a republican, does this provide information about how they voted 
on issue 2? The answer is no, since 


P (issue 1 = yes | republican) P (issue 2 = yes | republican) = 0.133 (1-54) 

is very close to P (issue 1 = yes n issue 2 = yes | republican). The votes are approximately inde¬ 
pendent given the knowledge that the member is a republican. A 
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As suggested by Examples 1.3.2 and 1.3.4, independence does not imply conditional indepen¬ 
dence or vice versa. This is further illustrated by the following examples. From now on, to 
simplify notation, we write the probability of the intersection of several events in the following 
form 


P(A,B,C) :=P(AnBnC). (1.55) 

Example 1.3.5 (Conditional independence does not imply independence). Your cousin Marvin 
from Exercise 1.2.6 always complains about taxis in New York. From his many visits to JFK he 
has calculated that 


P (taxi|rain) = 0.1, P (taxi|no rain) = 0.6, (1.56) 

where taxi denotes the event of finding a free taxi after picking up your luggage. Given the 
events rain and no rain , it is reasonable to model the events plane arrived late and taxi as 
conditionally independent, 

P (taxi, latejrain) = P (taxi|rain) P (late|rain), (1-57) 

P (taxi, latejno rain) = P (taxi|no rain) P (late|no rain). (1.58) 

The logic behind this is that the availability of taxis after picking up your luggage depends 
on whether it’s raining or not, but not on whether the plane is late or not (we assume that 
availability is constant throughout the day). Does this assumption imply that the events are 
independent? 

If they were independent, then knowing that your aunt was late would give no information to 
Marvin about taxi availability. However, 


P (taxi) = P (taxi, rain) + P (taxi, no rain) (by the law of total probability) (1.59) 

= P (taxi|rain) P (rain) + P (taxi|no rain) P (no rain) (1.60) 

= 0.1-0.2+ 0.6-0.8 = 0.5, (1.61) 


P (taxi | late) = 


P (taxi, late, rain) + P (taxi, late, no rain) 


(by the law of total probability) 


P (late) 

P (taxi|rain) P (late|rain) P (rain) + P (taxi|no rain) P (late|no rain) P (no rain) 


P (late) 


0.1 -0.75 -0.2 + 0.6 -0.125 -0.8 
0.25 


= 0.3. 


(1.62) 


P (taxi) ^4 p (taxi|late) so the events are not independent. This makes sense, since if the airplane 
is late, it is more probable that it is raining, which makes taxis more difficult to find. 

A 


Example 1.3.6 (Independence does not imply conditional independence). After looking at your 
probabilistic model from Example 1.2.1 your contact at JFK points out that delays are often 
caused by mechanical problems in the airplanes. You look at the data and determine that 


P (problem) = P (problem|rain) = P (problem|no rain) = 0.1, 


(1.63) 
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so the events mechanical problem and rain in NYC are independent, which makes intuitive 
sense. After some more analysis of the data, you estimate 

P (late|problem) = 0.7, P (late|no problem) = 0.2, P (late|no rain, problem) = 0.5. 


The next time you are waiting for Marvin at JFK, you start wondering about the probability of 
his plane having had some mechanical problem. Without any further information, this proba¬ 
bility is 0.1. It is a sunny day in New York, but this is of no help because according to the data 
(and common sense) the events problem and rain are independent. 

Suddenly they announce that Marvin’s plane is late. Now, what is the probability that his 
plane had a mechanical problem? At first thought you might apply Bayes’ rule to compute 
P (problem|late) = 0.28 as in Example 1.2.6. However, you are not using the fact that it is 
sunny. This means that the rain was not responsible for the delay, so intuitively a mechanical 
problem should be more likely. Indeed, 


P (problem|late, no rain) 


P (late, no rain, problem) 

P (late, no rain) 

P (late|no rain, problem) P (no rain) P (problem) 
P (late|no rain) P (no rain) 


0.5 • 0.1 
0.125 


= 0.4. 


(1.64) 

(by the Chain Rule) 

(1.65) 


Since P (problem|late, no rain) ^ P (problem|late) the events mechanical problem and rain in 
NYC are not conditionally independent given the event plane is late. 


A 





Chapter 2 

Random Variables 


Random variables are a fundamental tool in probabilistic modeling. They allow us to model 
numerical quantities that are uncertain: the temperature in New York tomorrow, the time of 
arrival of a flight, the position of a satellite... Reasoning about such quantities probabilistically 
allows us to structure the information we have about them in a principled way. 

2.1 Definition 

Formally, we define a random variables as a function mapping each outcome in a probability 
space to a real number. 

Definition 2.1.1 (Random variable). Given a probability space (fi, JR P), a random variable 
X is a function from the sample space O to the real numbers M. Once the outcome u> € of 
the experiment is revealed, the corresponding X (u) is known as a realization of the random 
variable. 

Remark 2.1.2 (Rigorous definition). If we want to be completely rigorous, Definition 2.1.1 is 
missing some details. Consider two sample spaces 1R and fR, o,nd a a-algebra J -2 of sets in fR■ 
Then, for X to be a random variable, there must exist a a-algebra JR in fR such that for any 
set S in JR the inverse image of S, defined by 

X~ l (S) := {oj | X (w) e 5} , (2.1) 

belongs to JR. Usually, we take JR to be the reals M and JR to be the Borel a-algebra, which is 
defined as the smallest a-algebra defined on the reals that contains all open intervals (amazingly, 
it is possible to construct sets of real numbers that do not belong to this a-algebra). In any case, 
for the purpose of these notes, Definition 2.1.1 is sufficient (more information about the formal 
foundations of probability can be found in any book on measure theory and advanced probability 
theory). 

Remark 2.1.3 (Notation). We often denote events of the form 

{X (w) € S : (2.2) 

for some random variable X and some set S as 

{X e S} (2.3) 


11 
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to alleviate notation, since the underlying probability space is often of no significance once we 
have specified the random variables of interest. 

A random variable quantifies our uncertainty about the quantity it represents, not the value 
that it happens to finally take once the outcome is revealed. You should never think of a random 
variable as having a fixed numerical value. If the outcome is known, then that determines a 
realization of the random variable. In order to stress the difference between random variables 
and their realizations, we denote the former with uppercase letters ( X , Y, ...) and the latter 
with lowercase letters (x, y, ...). 

If we have access to the probability space (II, T, P) in which the random variable is defined, then 
it is straightforward to compute the probability of a random variable X belonging to a certain 
set S': 1 * it is the probability of the event that comprises all outcomes in P which X maps to S, 

P(Y e S) = P({u; | X{u) G S}). (2.4) 

However, we almost never model the probability space directly, since this requires estimating the 
probability of every possible event in the corresponding cr-algebra. As we explain in Sections 2.2 
and 2.3, there are more practical methods to specify random variables, which automatically 
imply that a valid underlying probability space exists. The existence of this probability space 
ensures that the whole framework is mathematically sound, but you don’t really have to worry 
about it. 

2.2 Discrete random variables 

Discrete random variables take values on a finite or countably infinite subset of M such as the 
integers. They are used to model discrete numerical quantities: the outcome of the roll of a die, 
the score in a basketball game, etc. 

2.2.1 Probability mass function 

To specify a discrete random variable it is enough to determine the probability of each value 
that it can take. In contrast to the case of continuous random variables, this is tractable because 
these values are countable by definition. 

Definition 2.2.1 (Probability mass function). Let (H, J 7 , P) be a probability space and X : Ll — > 
Z a random variable. The probability mass function (pmf) of X is defined as 

px ( x ) := P ({w | X (uj) = cc}). (2.5) 

In words, px (x) is the probability that X equals x. 

We usually say that a random variable is distributed according to a certain pmf. 

If the discrete range of X is denoted by D, then the triplet (D,2 d ,px) is a valid probability 
space (recall that 2 D is the power set of D). In particular, p x is a valid probability measure 


1 Strictly speaking, S needs to belong to the Borel cr-algebra. Again, this comprises essentially any subset of the 

reals that you will ever encounter in probabilistic modeling 
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x 


Figure 2.1: Probability mass function of the random variable X in Example 2.2.2. 


which satisfies 


px (x) > 0 for any x E D, (2.6) 

Y J Px{x) = l. (2.7) 

xeD 

The converse is also true, if a function defined on a countable subset D of the reals is nonnegative 
and adds up to one, then it may be interpreted as the pmf of a random variable. In fact, in 
practice we usually define discrete random variables by just specifying their pmf. 

To compute the probability that a random variable X is in a certain set S we take the sum of 
the pmf over all the values contained in S: 

P(X eS) = Y^Px(x). (2.8) 

x£ S 


Example 2.2.2 (Discrete random variable). Figure 2.1 shows the probability mass function of 
a discrete random variable X (check that it adds up to one). To compute the probability of X 
belonging to different sets we apply (2.8): 

P (X € {1,4}) = p x (1) +Px (4) = 0.5, (2.9) 

P (X > 3) = px (4) +p x (5) = 0.6. (2.10) 

A 


2.2.2 Important discrete random variables 

In this section we describe several discrete random variables that are useful for probabilistic 
modeling. 
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Bernoulli 

Bernoulli random variables are used to model experiments that have two possible outcomes. 
By convention we usually represent an outcome by 0 and the other outcome by 1. A canonical 
example is flipping a biased coin, such that the probability of obtaining heads is p. If we encode 
heads as 1 and tails as 0, then the result of the coin flip corresponds to a Bernoulli random 
variable with parameter p. 

Definition 2.2.3 (Bernoulli). The pmf of a Bernoulli random variable with parameter p £ [0, 1] 
is given by 


px(0) = l-p, (2.11) 

px(l)=p. (2.12) 


A special kind of Bernoulli random variable is the indicator random variable of an event. This 
random variable is particularly useful in proofs. 

Definition 2.2.4 (Indicator). Let (fi, J 7 , P) be a probability space. The indicator random vari¬ 
able of an event S £ IF is defined as 


1, if iv £ S, 
0, otherwise. 


(2.13) 


By definition the distribution of an indicator random variable is Bernoulli with parameter P ( S ). 


Geometric 

Imagine that we take a biased coin and flip it until we obtain heads. If the probability of 
obtaining heads is p and the flips are independent then the probability of having to flip k times 
is 


P ( k flips) = P (1st flip = tails, ..., k — 1th flip = tails, fcth flip = heads) (2-14) 

= P (1st flip = tails) ■ • • P (k — 1th flip = tails) P (fcth flip = heads) (2.15) 

= (l-p) k - 1 p. (2.16) 

This reasoning can be applied to any situation in which a random experiment with a fixed prob¬ 
ability p is repeated until a particular outcome occurs, as long as the independence assumption 
is met. In such cases the number of repetitions is modeled as a geometric random variable. 

Definition 2.2.5 (Geometric). The pmf of a geometric random variable with parameter p is 
given by 

p x {k) = (l-p) k - 1 p, k = 1,2,... (2.17) 

Figure 2.2 shows the probability mass function of geometric random variables with different 
parameters. The larger p is, the more the distribution concentrates around smaller values of k. 


CHAPTER 2. RANDOM VARIABLES 


15 


p = 0.2 p = 0.5 p = 0.8 
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Figure 2.2: Probability mass function of three geometric random variables with different parameters. 


Binomial 

Binomial random variables are extremely useful in probabilistic modeling. They are used to 
model the number of positive outcomes of n trials modeled as independent Bernoulli random 
variables with the same parameter. The following example illustrates this with coin flips. 

Example 2.2.6 (Coin flips). If we flip a biased coin n times, what is the probability that we 
obtain exactly k heads if the flips are independent and the probability of heads is pi 

Let us first consider a simpler problem: what is the probability of first obtaining k heads and 
then n — k tails? By independence, the answer is 

P (k heads, then n — k tails) (2-18) 

= P (1st flip = heads, ..., kth flip = heads, k + 1th flip = tails,..., nth flip = tails) 

= P (1st flip = heads) • • • P (kth flip = heads) P (k + 1th flip = tails) • • • P (nth flip = tails) 

= p k (l-p) n ~ k . (2.19) 

Note that the same reasoning implies that this is also the probability of obtaining exactly k 
heads in any fixed order. The probability of obtaining exactly k heads is the union of all of these 
events. Because these events are disjoint (we cannot obtain exactly k heads in two different 
orders simultaneously) we can add their individual to compute the probability of our event of 
interest. We just need to know the number of possible orderings. By basic combinatorics, this 
is given by the binomial coefficient (?), defined as 

(n\ n\ 

\k) := k\ (n — k)\' (2 ‘ 20) 


We conclude that 

P (k heads out of n flips) = p k (1 — p)^ n ~ k ^ . 


( 2 . 21 ) 

A 


The random variable representing the number of heads in the example is called a binomial 
random variable. 
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p = 0.2 


p = 0.5 




p = 0.8 



Figure 2.3: Probability mass function of three binomial random variables with different values of p and 
n = 20 . 
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Figure 2.4: Probability mass function of three Poisson random variables with different parameters. 


Definition 2.2.7 (Binomial). The pmf of a binomial random variable with parameters n and p 
is given by 

p X (k) = (fj p k (1 - p)( n ~ k \ k = 0,1, 2 ,..., n. (2.22) 

Figure 2.3 shows the probability mass function of binomial random variables with different values 
of p. 

Poisson 

We motivate the definition of the Poisson random variable using an example. 

Example 2.2.8 (Call center). A call center wants to model the number of calls they receive 
over a day in order to decide how many people to hire. They make the following assumptions: 

1. Each call occurs independently from every other call. 

2. A given call has the same probability of occurring at any given time of the day. 

3. Calls occur at a rate of A calls per day. 
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In Chapter 5, we will see that these assumptions define a Poisson process. 

Our aim is to compute the probability of receiving exactly k calls during a given day. To do this 
we discretize the day into n intervals, compute the desired probability assuming each interval is 
very small and then let n —> oo. 

The probability that a call occurs in an interval of length 1/n is X/n by Assumptions 2 and 3. 
The probability that m > 1 calls occur is ( X/n) m . If n is very large this probability is negligible 
compared to the probability that either one or zero calls are received in the interval, in fact it 
tends to zero when we take the limit n —* oo. The total number of calls occurring over the whole 
hour can consequently be approximated by the number of intervals in which a call occurs, as 
long as n is large enough. Since a call occurs in each interval with the same probability and 
calls happen independently, the number of calls over a whole day can be modeled as a binomial 
random variable with parameters n and p := X/n. 

We now compute the distribution of calls when the intervals are arbitrarily small, i.e. when 
n —> oo: 


P (k calls during the day ) = lim P (k calls in n small intervals) 

n—>oo 

= lim p k (1 — p)^ n ~ k ^ 
n—>oo yk: J 

-j&G) G)‘HP 

= lim -- (i_ A” 

n ^°° k\ (n - k)l (n - X) k V n J 

X k e -A 

= k\ ' 

The last step follows from the following lemma proved in Section 2.7.1. 

Lemma 2.2.9. 


lim — 

rwoo ( n 


n! 

k)\ (n 



(2.23) 

(2.24) 

(2.25) 

(2.26) 
(2.27) 


(2.28) 

A 


Random variables with the pmf that we have derived in the example are called Poisson random 
variables. They are used to model situations where something happens from time to time at a 
constant rate: packets arriving at an Internet router, earthquakes, traffic accidents, etc. The 
number of such events that occur over a fixed interval follows a Poisson distribution, as long as 
the assumptions we listed in the example hold. 

Definition 2.2.10 (Poisson). The pmf of a Poisson random variable with parameter X is given 
by 

X k e~ x 


Px (k) 


k\ 


k = 0,1,2 ,... 


(2.29) 
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Binomial: n = 40, p = jjjj Binomial: n = 80, p = g|. 




Binomial: n = 400, P = Poisson: A = 20 




Figure 2.5: Convergence of the binomial pmf with p = X/n to a Poisson pmf of parameter A as n grows. 


Figure 2.4 shows the probability mass function of Poisson random variables with different values 
of A. In Example 2.2.8 we prove that as n —> oo the pmf of a binomial random variable with 
parameters n and X/n tends to the pmf of a Poisson with parameter A (as we will see later in 
the course, this is an example of convergence in distribution). Figure 2.5 shows an example of 
this phenomenon numerically; the convergence is quite fast. 

You might feel a bit skeptical about Example 2.2.8: the probability of receiving a call surely 
changes over the day and it must be different on weekends! That is true, but the model is 
actually very useful if we restrict our attention to shorter periods of time. In Figure 2.6 we show 
the result of modeling the number of calls received by a call center in Israel 2 over an interval of 
four hours (8 pm to midnight) using a Poisson random variable. We plot the histogram of the 
number of calls received during that interval for two months (September and October of 1999) 
together with a Poisson pmf fitted to the data (we will learn how to fit distributions to data 
later on in the course). Despite the fact that our assumptions do not hold exactly, the model 

2 The data is available here. 
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Figure 2.6: In blue, we see the histogram of the number of calls received during an interval of four 
hours over two months at a call center in Israel. A Poisson pmf approximating the distribution of the 
data is plotted in orange. 


produces a reasonably good fit. 

2.3 Continuous random variables 

Physical quantities are often best described as continuous: temperature, duration, speed, weight, 
etc. In order to model such quantities probabilistically we could discretize their domain and 
represent them as discrete random variables. However, we may not want our conclusions to 
depend on how we choose the discretization grid. Constructing a continuous model allows us to 
obtain insights that are valid for sufficiently fine grids without worrying about discretization. 

Precisely because continuous domains model the limit when discrete outcomes have an arbitrar¬ 
ily fine granularity, we cannot characterize the probabilistic behavior of a continuous random 
variable by just setting values for the probability of X being equal to individual outcomes, as we 
do for discrete random variables. In fact, we cannot assign nonzero probabilities to specific out¬ 
comes of an uncertain continuous quantity. This would result in uncountable disjoint outcomes 
with nonzero probability. The sum of an uncountable number of positive values is infinite, so 
the probability of their union would be greater than one, which does not make sense. 

More rigorously, it turns out that we cannot define a valid probability measure on the power set 
of M (justifying this requires measure theory and is beyond the scope of these notes). Instead, 
we consider events that are composed of unions of intervals o/M. Such events form a cr-algebra 
called the Borel cr-algebra. This a -algebra is granular enough to represent any set that you 
might be interested in (try thinking of a set that cannot be expressed as a countable union of 
intervals), while allowing for valid probability measures to be defined on it. 
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2.3.1 Cumulative distribution function 

To specify a random variable on the Borel cr-algebra it suffices to determine the probability of 
the random variable belonging to all intervals of the form (— oo,x) for any i 6 R. 

Definition 2.3.1 (Cumulative distribution function). Let (12, J 7 , P) be a probability space and 
X : Ll —> M a random variable. The cumulative distribution function (cdf) of X is defined as 

F x (x) :=P(X<x). (2.30) 

In words, F x ( x ) is the probability of X being smaller than x. 

Note that the cumulative distribution function can be defined for both continuous and discrete 
random variables. 

The following lemma describes some basic properties of the cdf. You can find the proof in 
Section 2.7.2. 

Lemma 2.3.2 (Properties of the cdf). For any continuous random variable X 

lim Fx (a;) = 0, (2.31) 

x — y —oo 

lim F x (x) = 1, (2.32) 

x —^OO 

F x ( b ) > F x (a) if b > a, i.e. F x is nondecreasing. (2.33) 

To see why the cdf completely determines a random variable recall that we are only considering 
sets that can be expressed as unions of intervals. The probability of a random variable X 
belonging to an interval (a, b] is given by 

P (a < X < b) = P {X < b) - P (X < a) (2.34) 

= F X (b) - F x (a). (2.35) 


Remark 2.3.3. Since individual points have zero probability, for any continuous random vari¬ 
able X 

P (a < X < b) = P (a < X < b) = P (a < X < b) = P (a < X < b). (2.36) 


Now, to find the probability of X belonging to any particular set, we only need to decompose it 
into disjoint intervals and apply (2.35), as illustrated by the following example. 


Example 2.3.4 (Continuous random variable). Consider a continuous random variable X with 
a cdf given by 


F X (*) 


0 

0.5z 
< 0.5 

0.5 (l + {x 
1 


for x < 0 , 
for 0 < x < 1, 
for 1 < x < 2 , 
2) 2 ) for 2 < x < 3, 

for x > 3. 


(2.37) 
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Figure 2.7: Cumulative distribution function of the random variable in Examples 2.3.4 and 2.3.7. 


Figure 2.7 shows the cdf on the left image. You can check that it satisfies the properties in 
Lemma 2.3.2. To determine the probability that X is between 0.5 and 2.5, we apply (2.35), 

P (0.5 < X < 2.5) = F x (2.5) - F x (0.5) = 0.375, (2.38) 

as illustrated in Figure 2.7. A 


2.3.2 Probability density function 

If the cdf of a continuous random variable is differentiable, its derivative can be interpreted as 
a density function. This density can then be integrated to obtain the probability of the random 
variable belonging to an interval or a union of intervals (and hence to any Borel set). 

Definition 2.3.5 (Probability density function). Let X : —> M be a random variable with cdf 

F x ■ If F x is differentiable then the probability density function or pdf of X is defined as 

fx (x) := FXA. (2.39) 

a x 


Intuitively, f x (x) A is the probability of X belonging to an interval of width A around x as 
A —> 0. By the fundamental theorem of calculus, the probability of a random variable X 
belonging to an interval is given by 


P (a < X < b) 


F x (6) - F x (a) 
[ fx (x) dx. 


(2.40) 

(2.41) 


Our sets of interest belong the Borel c-algebra, and hence can be decomposed into unions of 
intervals, so we can obtain the probability of X belonging to any such set S by integrating its 
pdf over S 


fx (x) dx. 


P (X e S) 


(2.42) 
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Figure 2.8: Probability density function of the random variable in Examples 2.3.4 and 2.3.7. 


In particular, since X belongs to M by definition 

roc 

/ f x (x)dx = P(X eR) = 1. 


(2.43) 


It follows from the monotonicity of the cdf (2.33) that the pdf is nonnegative 

fx (x) > 0, (2.44) 

since otherwise we would be able to find two points x\ < X 2 for which F\ (a^) < Fx (.ti). 


Remark 2.3.6 (The pdf is not a probability measure). The pdf is a density which must be 
integrated to yield a probability. In particular, it is not necessarily smaller than one (for example, 
take o = 0 and b = 1/2 in Definition 2.3.8 below). 


Finally, just as in the case of discrete random variables, we often say that a random variable is 
distributed according to a certain pdf or cdf, or that we know its distribution. The reason is 
that the pmf, pdf or cdf suffice to characterize the underlying probability space. 


Example 2.3.7 (Continuous random variable (continued)). To compute the pdf of the random 
variable in Example 2.3.4 we differentiate its cdf, to obtain 


0 

for 

X 

< 

0 . 

t 

0.5 

for 

0 

< 

X 

< 

0 

for 

1 

< 

X 

< 

x — 2 

for 

2 

< 

X 

< 

0 

for 

X 

> 

3. 



(2.45) 


Figure 2.8 shows the pdf. You can check that it integrates to one. To determine the probability 
that X is between 0.5 and 2.5, we can just integrate over that interval to obtain the same answer 
as in Example 2.3.4, 


P (0.5 < X < 2.5) 


/ fx (x) dir 
' 0.5 

r 1 f 2.5 

/ 0.5 dir + / x - 2 dir = 0.375. 

/0.5 J 2 


(2.46) 

(2.47) 
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Figure 2.9: Probability density function (left) and cumulative distribution function (right) of a uniform 
random variable X. 


Figure 2.8 illustrates that the probability of an event is equal to the area under the pdf once we 
restrict it to the corresponding subset of the real line. 

A 


2.3.3 Important continuous random variables 

In this section we describe several continuous random variables that are useful in probabilistic 
modeling and statistics. 

Uniform 

A uniform random variable models an experiment in which every outcome within a continuous 
interval is equally likely. As a result the pdf is constant over the interval. Figure 2.9 shows the 
pdf and cdf of a uniform random variable. 

Definition 2.3.8 (Uniform). The pdf of a uniform random variable with domain [a,b\, where 
b > a are real numbers, is given by 

fx (X) = jj) 


if a < x < b, 
otherwise. 


(2.48) 


Exponential 

Exponential random variables are often used to model the time that passes until a certain event 
occurs. Examples include decaying radioactive particles, telephone calls, earthquakes and many 
others. 


Definition 2.3.9 (Exponential). The pdf of an exponential random variable with parameter A 
is given by 


(\e~ Xx , 

l°> 


fx (x) 


if x > 0, 
otherwise. 


(2.49) 
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Figure 2.10: Probability density functions of exponential random variables with different parameters. 


Figure 2.10 shows the pdf of three exponential random variables with different parameters. In 
order to illustrate that the potential of exponential distributions for modeling real data, in 
Figure 2.11 we plot the histogram of inter-arrival times of calls at the same call center in Israel 
we mentioned earlier. In more detail, these inter-arrival times are the times between consecutive 
calls occurring between 8 pm and midnight over two days in September 1999. An exponential 
model fits the data quite well. 

An important property of an exponential random variable is that it is memoryless. We elaborate 
on this property, which is shared by the geometric distribution, in Section 2.4. 

Gaussian or Normal 

The Gaussian or normal random variable is arguably the most popular random variable in all 
of probability and statistics. It is often used to model variables with unknown distributions in 
the natural sciences. This is motivated by the fact that sums of independent random variables 
often converge to Gaussian distributions. This phenomenon is captured by the Central Limit 
Theorem, which we discuss in Chapter 6. 

Definition 2.3.10 (Gaussian). The pdf of a Gaussian or normal random variable with mean n 
and standard deviation a is given by 

f X (x) = =^e~^Sr. (2.50) 

V 27177 

A Gaussian distribution with mean /i and standard deviation a is usually denoted by N (/i,cr 2 ). 

We provide formal definitions of the mean and the standard deviation of a random variable in 
Chapter 4. For now, you can just think of them as quantities that parametrize the Gaussian 
pdf. 

It is not immediately obvious that the pdf of the Gaussian integrates to one. We establish this 
in the following lemma. 

Lemma 2.3.11 (Proof in Section 2.7.3). The pdf of a Gaussian random variable integrates to 


one. 
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Figure 2.11: Histogram of inter-arrival times of calls at a call center in Israel (red) compared to its 
approximation by an exponential pdf. 



x 


Figure 2.12: Gaussian random variable with different means and standard deviations. 















CHAPTER 2. RANDOM VARIABLES 


26 


0.25 

0.20 

0.15 

0.10 

0.05 

60 62 64 66 68 70 72 74 76 

Height (inches) 



Figure 2.13: Histogram of heights in a population of 25,000 people (blue) and its approximation using 
a Gaussian distribution (orange). 


Figure 2.12 shows the pdfs of two Gaussian random variables with different values of fj, and a. 
Figure 2.13 shows the histogram of the heights in a population of 25,000 people and how it is 
very well approximated by a Gaussian random variable 3 . 

An annoying feature of the Gaussian random variable is that its cdf does not have a closed form 
solution, in contrast to the uniform and exponential random variables. This complicates the 
task of determining the probability that a Gaussian random variable is in a certain interval. To 
tackle this problem we use the fact that if A is a Gaussian random variable with mean fi and 
standard deviation a, then 

U := ^ (2.51) 

a 

is a standard Gaussian random variable, which means that its mean is zero and its standard 
deviation equals one. See Lemma 2.5.1 for the proof. This allows us to express the probability 
of X being in an interval [a, 6] in terms of the cdf of a standard Gaussian, which we denote by 


P(I £ [a, b\) = P 
= $ 


x-fi 

a 

b — n 
a 


a — n b — /jl 


- $ 


a a 
a — n 
a 


(2.52) 

(2.53) 


As long as we can evaluate ( f>, this formula allows us to deal with arbitrary Gaussian random 
variables. To evaluate $ people used to resort to lists of tabulated values, compiled by computing 
the corresponding integrals numerically. Nowadays you can just use Matlab, WolframAlpha, 
SciPy, etc. 

3 The data is available 
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Figure 2.14: Pdfs of beta random variables with different values of the a and b parameters. 


Beta 


Beta distributions allow us to parametrize unimodal continuous distributions supported on the 
unit interval. This is useful in Bayesian statistics, as we discuss in Chapter 10. 


Definition 2.3.12 (Beta distribution). The pdf of a beta distribution with parameters a and b 
is defined as 


where 


fp (°i a, b ) 


e a ~ 1 (i-e) b - 1 

P(a,b) 

0 


*/o < e < i, 

otherwise, 


P (a,b ) 



u) b 1 du. 


(2.54) 


(2.55) 


j3 (a, b ) is a special function called the beta function or Euler integral of the first kind, which 
must be computed numerically. The uniform distribution is an example of a beta distribution 
(where a = 1 and b = 1). Figure 2.14 shows the pdf of several different beta distributions. 


2.4 Conditioning on an event 

In Section 1.2 we explain how to modify the probability measure of a probability space to 
incorporate the assumption that a certain event has occurred. In this section, we review this 
situation when random variables are involved. In particular, we consider a random variable X 
with a certain distribution represented by a pmf, cdf or pdf and explain how its distribution 
changes if we assume that I £5, for any set S belonging to the Borel u-algebra (remember 
that this includes essentially any useful set you can think of). 
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If X is discrete with pmf p x , the conditional pmf of X given X E S is 

Px\xeS ( x ) := P (X = x\X E S) (2.56) 

f v Px ^ < \ if x e S 

= l ^sesPx(s) (2.57) 

I 0 otherwise. 

This is a valid pmf in the new probability space restricted to the event {X G 5}. 

Similarly if X is continuous with pdf fx , the conditional cdf of X given the event X E S is 


F X \XeS (x) := P (X < x\X E 5) 

_ P (X < x, X € S) 
P(x £S) 

= fu< x ,ues fx M du 

Les ^ ( u ) 


(2.58) 

(2.59) 

(2.60) 


again by the definition of conditional probability. One can check that this is a valid cdf in the 
new probability space. To obtain the conditional pdf we just differentiate this cdf, 


fx\xes ( x ) 


dF x \xes ( x ) 
dx 


(2.61) 


We now apply this ideas to show that the geometric and exponential random variables are 
memoryless. 


Example 2.4.1 (Geometric random variables are memoryless). We flip a coin repeatedly until 
we obtain heads, but pause after a couple of flips (which were tails). Let us assume that the 
flips are independent and have the same bias p (i.e. the probability of obtaining heads in every 
flip is p). What is the probability of obtaining heads in k more flips? Perhaps surprisingly, it is 
exactly the same as the probability of obtaining a heads after k flips from the beginning. 


To establish this rigorously we compute the conditional pmf of a geometric random variable X 
conditioned on the event {X > fco} (i.e. the first ko were tails in our example). Applying (2.56) 
we have 


Px\x>k 0 ( fc ) 


Px (k) 

TZLko+iPxW 

(i -p) k ~ l P 

V°° (1 — 1 n 

2—im=k o+l v4 P) P 

(i-pj^o-l p 


(2.62) 

(2.63) 

(2.64) 


if k > ko and zero otherwise. We have used the fact that the geometric series 


for any a < 1. 


OO 

E 

m=ko~\-l 


a ko+1 
1 — a 


(2.65) 


In the new probability space where the count starts at ko + 1 the conditional pmf is that of a 
geometric random variable with the same parameter as the original one. The first ko flips don’t 
affect the future, once it is revealed that they were tails. 


A 
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Example 2.4.2 (Exponential random variables are nrenroryless). Let us assume that the inter¬ 
arrival times of your emails follow an exponential distribution (over intervals of several hours 
this is probably a good approximation, let us know if you check). You receive an email. The 
time until you receive your next email is exponentially distributed with a certain parameter A. 
No email arrives in the next to minutes. Surprisingly, the time from then until you receive your 
next email is again exponentially distributed with the same parameter, no matter the value of 
to- Just like geometric random variables, exponential random variables are memory less. 


Let us prove this rigorously. We compute the conditional cdf of an exponential random variable T 
with parameter A conditioned on the event {T > to}- for an arbitrary to > 0- by applying (2.60) 


f! fx (u) du 

F T\T>t 0 (t) - (.00 

Jto f T (“) du 

(2.66) 

g Ai _ g Xt 0 


— g A£o 

(2.67) 

= 1 — e _A ( t_t °). 

(2.68) 


Differentiating with respect to t yields an 


exponential pdf fx\T>t 0 (t) = Ae to ^ starting at to- 


A 


2.5 Functions of random variables 


Computing the distribution of a function of a random variable is often very useful in probabilistic 
modeling. For example, if we model the current in a circuit using a random variable X, we might 
be interested in the power Y := rX 2 dissipated across a resistor with deterministic resistance 
r. If we apply a deterministic function g : M —> M to a random variable X , then the result 
Y := g (X) is not a deterministic quantity. Recall that random variables are functions from a 
sample space 0 to M. If X maps elements of Cl to M, then so does Y since Y (w) = g (X (w)). 
This means that Y is also a random variable. In this section we explain how to characterize the 
distribution of Y when the distribution of X is known. 

If X is discrete, then it is straightforward to compute the pmf of g ( X ) from the pmf of X , 

PY ( y) = P {Y = y) (2.69) 

= P (g(X) = y) (2.70) 

= px • ( 2 - 71 ) 

{ x I S(a0=2/I 


If X is continuous, the procedure is more subtle. We first compute the cdf of Y by applying the 
definition, 


F y (y) = P (Y< y) 

= P (g(X)<y) 


(2.72) 

(2.73) 


fx (x) dx, 


(2.74) 


I a{V)<y} 

where the last equality obviously only holds if X has a pdf. We can then obtain the pdf of Y 
from its cdf if it is differentiable. This idea can be used to prove a useful result about Gaussian 
random variables. 




CHAPTER 2. RANDOM VARIABLES 


30 


Lemma 2.5.1 (Gaussian random variable). If X is a Gaussian random variable with mean p 
and standard deviation a, then 

U := X ~ 11 (2.75) 

a 

is a standard Gaussian random variable. 


Proof. We apply (2.74) to obtain 

'X-fi 


F v ( u) = P 


< u 


a 


' ( x—fi)/cr<u 'JTtxo 


_ O-M) 

e 2 CT 2 dj; 


r i 

~ 7-oo 7 ^ 

Differentiating with respect to u yields 


x — n 


_e 2 drc by the change of variables w = 

—oc v 27T er 


, , X 1 — d 

/l ' ( “ ) = 7^ e ' 


so U is indeed a standard Gaussian random variable. 


(2.76) 

(2.77) 

(2.78) 


(2.79) 

□ 


2.6 Generating random variables 

Simulation is a fundamental tool in probabilistic modeling. Simulating the outcome of a model 
requires sampling from the random variables included in it. The most widespread strategy for 
generating samples from a random variable decouples the process into two steps: 

1. Generating samples uniformly from the unit interval [0,1]. 

2. Transforming the uniform samples so that they have the desired distribution. 

Here we focus on the second step, assuming that we have access to a random-number generator 
that produces independent samples following a uniform distribution in [0,1]. The construction 
of good uniform random generators is an important problem, which is beyond the scope of these 
notes. 

2.6.1 Sampling from a discrete distribution 

Let X be a discrete random variable with pmf px and U a uniform random variable in [0,1]. 
Our aim is to transform a sample from U so that it is distributed according to px- We denote 
the values that have nonzero probability under px by x\, X 2 , ■ ■ ■ 

For a fixed i, assume that we assign all samples of U within an interval of length px {xf) 
to Xi. Then the probability that a given sample from U is assigned to x* is exactly px (xi)\ 
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X2 

X\ 

0 u ± F x (x 1 ) “1 u 5 u 3 Fx (x 2 ) U2 1 



Figure 2.15: Illustration of the method to generate samples from an arbitrary discrete distribution 
described in Section 2.6.1. The cdf of a discrete random variable is shown in blue. The samples 114 and 
U 2 from a uniform distribution are mapped to x\ and X 3 respectively, whereas u \, M 3 and M 5 are mapped 
to £3. 


Very conveniently, the unit interval can be partitioned into intervals of length p x {xf). We can 
consequently generate X by sampling from U and setting 


'xi 


X2 


X = 



if 0 < U < p x (xi ), 

if p x (aq) <U < p x (xi) + p x (x 2 ), 

if E}Z\px (xj) <U < E}=i Px (xj), 


(2.80) 


Recall that the cdf of a discrete random variable equals 

Fx (x) = P(X<x) (2.81) 

= ^ p x (Xi) , (2.82) 

Xi<X 

so our algorithm boils down to obtaining a sample u from U and then outputting the x* such 
that F x (xi- 1 ) < u < F x (xi). This is illustrated in Figure 2.15. 


2.6.2 Inverse-transform sampling 

Inverse-transform sampling makes it possible to sample from an arbitrary distribution with a 
known cdf by applying a deterministic transformation to uniform samples. Intuitively, we can 
interpret it as a generalization of the method in Section 2.6.1 to continuous distributions. 

Algorithm 2.6.1 (Inverse-transform sampling). Let X be a continuous random variable with 
cdf F x and U a random variable that is uniformly distributed in [0,1] and independent of X. 


1. Obtain a sample u of U. 

2. Set x := F^ 1 (■ u ). 
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0 u 2 U 5 U4 U3 U\ 1 


Figure 2.16: Samples from an exponential distribution with parameter A = 1 obtained by inverse- 
transform sampling as described in Example 2.6.4. The samples u\, ..., u$ are generated from a uniform 
distribution. 


The careful reader will point out that Fx may not be invertible at every point. To avoid this 
problem we define the generalized inverse of the cdf as 

FL 1 (u) := min {Fx (x) = u} . (2.83) 

X 


The function is well defined because all cdfs are non-decreasing, so Fx is equal to a constant c 
in any interval [aq,^] where it is not invertible. 

We now prove that Algorithm 2.6.1 works. 


Theorem 2.6.2 (Inverse-transform sampling works). The distribution of Y 

= F x 1 (U) is the 

same as the distribution of X. 


Proof. We just need to show that the cdf of Y is equal to Fx- We have 


F y (y) = P (Y < y ) 

(2.84) 

= P (F~ l (U) < y) 

(2.85) 

= P (U < F x (y)) 

(2.86) 

r F x{y) 


= / d u 

Ju=0 

(2.87) 

= F x (y), 

(2.88) 


where in step ( 2 . 86 ) we have to take into account that we are using the generalized inverse of 
the cdf. This is resolved by the following lemma proved in Section 2.7.4. 

Lemma 2.6.3. The events { F'y 1 (U) < y } and {U < Fx ( y )} are equivalent. 


□ 
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Example 2.6.4 (Sampling from an exponential distribution). Let X be an exponential random 
variable with parameter A. Its cdf Fx (x) := 1 — e~ Xx is invertible in [0, oo]. Its inverse equals 

(2 ' 89) 

Ey 1 (17) is an exponential random variable with parameter A by Theorem 2.6.2. Figure 2.16 
shows how the samples of U are transformed into samples of X. 

A 


2.7 Proofs 

2.7.1 Proof of Lemma 2.2.9 

For any fixed constants ci and C 2 


lim ^ = 1, 
n—>-oc n — C 2 


so that 


n\ 


n n — 1 n — k + 1 


lim — , — 

n ->°° (n - k)\ (n - A) n-X n- A n- A 


= 1. 


The result follows from the following basic calculus identity: 


lim ( 1 — — ) = e A . 

n—>oo V n 


2.7.2 Proof of Lemma 2.3.2 

To establish (2.31) 

lim Fx (x) = 1 — lim P (X > x) 


1 - P (X > 0) - lim VP (-i >X>~(i+ 1)) 

n—¥ oo * 


i =0 

= 1 - P ( lim {X > 0} U U'Ln {-i>X>-(i + 1 

Vn^-oo 

= 1 -P(fi) = o. 

The proof of (2.32) follows from this result. Let Y = — X , then 

lim Fx (a;) = lim P (X < x) 

x^-oo x—>oo 

= 1 — lim P (X > x) 

x—>oo 

= 1 — lim P (—X < x ) 

x — y —OO 

= 1 — lim Fy (x) = 1 by (2.32). 

x — y —oo 

Finally, (2.33) holds because {X < a} C {X < b}. 


(2.90) 


(2.91) 


(2.92) 


(2.93) 

(2.94) 

(2.95) 

(2.96) 

(2.97) 

(2.98) 

(2.99) 
( 2 . 100 ) 
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2.7.3 Proof of Lemma 2.3.11 

The result is a consequence of the following lemma. 

Lemma 2.7.1. 



e dt = 


Proof. Let us define 


/ OO 

e~ x2 dx. 

-OO 

Now taking the square and changing to polar coordinates, 

/ OO POO 

e~ x2 dx I e~ y2 dy 

-C 


' —OO J —OO 

POO POO 


POO POO 

[ f e~( x2+y2 )dxdy 

J x=—oo Jy=— oo 


> X——00 J y=—oo 
p27T poo 


re 


-( r2 )d0ch 


J 0=0 J r=—oo 
= 7re-( r2 )]g° = 7T. 


To complete the proof we use the change of variables t = (x — y) j\[2o. 


( 2 . 101 ) 

( 2 . 102 ) 

(2.103) 

(2.104) 

(2.105) 

(2.106) 

□ 


2.7.4 Proof of Lemma 2.6.3 

{Px 1 ( u ) < V} implies {U < F x {y)} 


Assume that U > F\ (y), then for all x, such that F x (x) = U , x > y because the cdf is nonde¬ 
creasing. In particular min x {F\ (x) = U} > y. 


{U < F x (y)} implies {F x x (U) < y) 


Assume that min x {F x (x) = U} > y, then U > F x (y) because the cdf is nondecreasing. The 
inequality is strict because U = F x (y) would imply that y belongs to {F x (x) = U}, which 
cannot be the case as we are assuming that it is smaller than the minimum of that set. 




Chapter 3 


Multivariate Random Variables 


Probabilistic models usually include multiple uncertain numerical quantities. In this chapter we 
describe how to specify random variables to represent such quantities and their interactions. In 
some occasions, it will make sense to group these random variables as random vectors, which 
we write using uppercase letters with an arrow on top: X. Realizations of these random vectors 
are denoted with lowercase letters: x. 

3.1 Discrete random variables 

Recall that discrete random variables are numerical quantities that take either finite or countably 
infinite values. In this section we explain how to manipulate multiple discrete random variables 
that share a common probability space. 

3.1.1 Joint probability mass function 

If several discrete random variables are defined on the same probability space, we specify their 
probabilistic behavior through their joint probability mass function, which is the probability 
that each variable takes a particular value. 

Definition 3.1.1 (Joint probability mass function). Let X : fl —> R\ and Y : Q Ry be 
discrete random variables (Rx and Ry are discrete sets) on the same probability space (fi, J 7 , P). 
The joint pmf of X and Y is defined as 

Px,Y (x, y) := P (X = x, Y = y) . (3.1) 

In words, px,Y {x, y) is the probability of X and Y being equal to x and y respectively. 
Similarly, the joint pmf of a discrete random vector of dimension n 

Vf 

X := (3.2) 

X n . 

with entries X, : Q —>■ Ry, (R.\, ■ ■ ■, R n are all discrete sets) belonging to the same probability 
space is defined as 

Px (x) := P (Xi = xi, X 2 = X 2 , ■ ■ ■, X n = x n ). (3.3) 


35 
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As in the case of the pmf of a single random variable, the joint pmf is a valid probability measure 
if we consider a probability space where the sample space is Rx x Ry 1 * (or R\ 1 x Rx 2 • ■ ■ x Rx n 
in the case of a random vector) and the cr-algebra is just the power set of the sample space. This 
implies that the joint pmf completely characterizes the random variables or the random vector, 
we don’t need to worry about the underlying probability space. 

By the definition of probability measure, the joint pmf must be nonnegative and its sum over 
all its possible arguments must equal one, 

Px.Y (x, y)> 0 for any x € R X ,y G Ry, 

Y Y PX ’ Y = L 

xgRx y£Ry 

By the Law of Total Probability, the joint pmf allows us to obtain the probability o 
belonging to any set S C Rx x Ry, 

P((x,Y)eS) = P(u {Xi y \ G S { X = x, Y = y}) (union of disjoint events) 

= Y P (X = x,Y = y) 

(x,y)GS 

= Y px ’ Y (* t ’ y ) • 

{x,y)es 

These properties also hold for random vectors (and groups of more than two random variables). 
For any random vector X, 

Px (%) > 0, 

Px ( f ) = 1 

Xl£RlX 2 £R2 Xn&Rn 

The probability that X belongs to a discrete set S C M n is given by 

p(xtS)=YPxW- (3-fl) 

x£S 



(3.9) 

(3.10) 


(3.4) 

(3.5) 

X and Y 

(3.6) 

(3.7) 

(3.8) 


3.1.2 Marginalization 

Assume we have access to the joint pmf of several random variables in a certain probability 
space, but we are only interested in the behavior of one of them. To compute the value of its 
pmf for a particular value, we fix that value and sum over the remaining random variables. 


Indeed, by the Law of Total Probability 

Px(x) = P(X = x) (3.12) 

= P (U y eR Y {X = x, Y = y}) (union of disjoint events) (3.13) 

= Y F (X = x,Y = y) (3.14) 

3/6 Ry 

= Y pxY ■ ( 3 - 15 ) 

2/6 Ry 


1 This is the Cartesian product of the two sets, defined in Section A.2, which contains all possible pairs (x, y) 

where x £ Rx and y £ R y . 
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When the joint pmf involves more than two random variables the argument is exactly the same. 
This is called marginalizing over the other random variables. In this context, the pmf of a 
single random variable is called its marginal pmf. Table 3.1 shows an example of a joint pmf 
and the corresponding marginal pmfs. 

If we are interested in computing the joint pmf of several entries in a random vector, instead 
of just one, the marginalization process is essentially the same. The pmf is again obtained 
by summing over the rest of the entries. Let I C {1,2,... ,n} be a subset of m < n entries 
of an n-dimensional random vector X and X% the corresponding random subvector. To com¬ 
pute the joint pmf of Xj we sum over all the entries that are not in I, which we denote by 
iil j J2) • • • j jn—m} := {1)2,..., n} /X 

Vx T {xx)= Y ■■ Y Px ( f ) • ( 3 - 16 ) 

X j.. E R j Xn 0 (- R,r. Xj _ J fi e Rj r _ 


3.1.3 Conditional distributions 

Conditional probabilities allow us to update our uncertainty about the quantities in a probabilis¬ 
tic model when new information is revealed. The conditional distribution of a random variable 
specifies the behavior of the random variable when we assume that other random variables in 
the probability space take a fixed value. 

Definition 3.1.2 (Conditional probability mass function). The conditional probability mass 
function of Y given X, where X and Y are discrete random variables defined on the same 
probability space, is given by 


PY\x (y\x) = P(Y = y\X 
= px,y (x, y) 
PX (x) 


x) 

if Px ( x ) > 0 


and is undefined otherwise. 


(3.17) 

(3.18) 


The conditional pmf px\Y ( - |y) characterizes our uncertainty about X conditioned on the event 
{Y = y}. This object is a valid pmf of X, so that if Rx is the range of X 

Y Px x ( x l y) = 1 ( 3 - 19 ) 

xGRx 

for any y for which it is well defined. However, it is not a pmf for Y. In particular, there is no 
reason for Yly^R Y Px\Y ( x \y) to add up to one! 

We now define the joint conditional pmf of several random variables (equivalently of a subvector 
of a random vector) given other random variables (or entries of the random vector). 

Definition 3.1.3 (Conditional pmf). The conditional pmf of a discrete random subvector X%, 
1C {1, 2,... ,n}, given another subvector Xj is 

p x x \Xj (*zl%) : = p^SjY (3 ‘ 20) 

where {ji,j 2 , ■ • •, jn-m} ■= {1, 2,..., n} /X. 
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R 



PL 


Pl\r(-\0) 


Pl\r{-\1) 

15 


7 


1 

20 


8 


4 

5 


1 


3 

20 


8 


4 


Table 3.1: Joint, marginal and conditional pmfs of the random variables L and R defined in Exam¬ 
ple 3.1.5. 


The conditional pmfs Py\x H®) and P\ x \Xj are valid pmfs in the probability space where 

X = x or Xj = xj respectively. For instance, they must be nonnegative and add up to one. 

From the definition of conditional pmfs we derive a chain rule for discrete random variables and 
vectors. 


Lemma 3.1.4 (Chain rule for discrete random variables and vectors). 


Px {x)p Y \x (: y\x ) , 


(3.21) 

PX 1 (xi)px 2 \X 1 (®2|®l) • ■■Px n \x 1 ,...,x n - 1 (®n|*l, • • 

• ? %n— 1 ) 

(3.22) 

TL 

II P Xi|X { 1 ,...,i_ 1} ( X i\ X V ’ 

2—1 


(3.23) 


where the order of indices in the random vector is arbitrary (any order works). 


The following example illustrates the definitions of marginal and conditional pmfs. 

Example 3.1.5 (Flights and rains (continued)). Within the probability space described in 
Example 1.2.1 we define a random variable 

1 if plane is late, 

0 otherwise, 

to represent whether the plane is late or not. Similarly, 

{ 1 it rains, 

0 otherwise, 



(3.24) 


(3.25) 
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represents whether it rains or not. Equivalently, these random variables are just the indicators 
R = l ra in and L = liate- Table 3.1 shows the joint, marginal and conditional prnfs of L and R. 

A 

3.2 Continuous random variables 

Continuous random variables allow us to model continuous quantities without having to worry 
about discretization. In exchange, the mathematical tools to manipulate them are somewhat 
more complicated than in the discrete case. 

3.2.1 Joint cdf and joint pdf 

As in the case of univariate continuous random variables, we characterize the behavior of sev¬ 
eral continuous random variables defined on the same probability space through the probability 
that they belong to Borel sets (or equivalently unions of intervals). In this case we are con¬ 
sidering multidimensional Borel sets, which are Cartesian products of one-dimensional Borel 
sets. Multidimensional Borel sets can be represented as unions of multidimensional intervals 
or hyperrectangles (defined as Cartesian products of one-dimensional intervals). The joint cdf 
compiles the probability that the random variables belong to the Cartesian product of intervals 
of the form (—oo, r] for every r£l. 

Definition 3.2.1 (Joint cumulative distribution function). Let (Q,F,P) be a probability space 
and X,Y : fl —> M random variables. The joint cdf of X and Y is defined as 

F x ,y (x , y) := P (X < x, Y < y). (3.26) 

In words, F X y (x,y) is the probability of X and Y being smaller than x and y respectively. 

Let X : fl —» M n be a random vector of dimension n on a probability space (fl, T, P). The joint 


cdf of X is defined as 

Fjt (x) := P (jli < xi,X 2 <X 2 ,...,X n < x^j . (3.27) 

In words, Fy (x) is the probability that Xj < Xi for all i = 1,2,... ,n. 

We now record some properties of the joint cdf. 

Lemma 3.2.2 (Properties of the joint cdf). 

lirn F x y (x, y) = 0, (3.28) 

£—>•—00 

lirn F X y {x, y) = 0, (3.29) 

y — y —oo 

lirn F XY (x,y) = 1, (3.30) 

x—Yoo ,y —>-oo 

F x .y (xi,yi) < F X y (£’ 2 , 2 / 2 ) */ £2 > £ 1 , 2/2 > 2/1, *-e. F X y is nondecreasing. (3.31) 

Proof. The proof follows along the same lines as that of Lemma 2.3.2. pj 
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The joint cdf completely specifies the behavior of the corresponding random variables. Indeed, 
we can decompose any Borel set into a union of disjoint n-dimensional intervals and compute 
their probability by evaluating the joint cdf. Let us illustrate this for the bivariate case: 


P (ah < X < x 2 , yi < Y < y 2 ) = P ({X < x 2l Y < y 2 } n {X > xi} n {Y > y{\) (3.32) 

= P (X <x 2 ,Y < y 2 ) - P (X < xi, Y < y 2 ) (3.33) 

- P (X < X2 , Y <yi) + P(X< Xl ,Y < yi) (3.34) 

= Fx,y (x 2 ,y 2 ) - F x ,y (xi,y 2 ) - F x ,y (x 2 ,yi) + F x ,y (xi,yi) ■ 


This means that, as in the univariate case, to define a random vector or a group of random 
variables all we need to do is define their joint cdf. We don’t have to worry about the underlying 
probability space. 

If the joint cdf is differentiable, we can differentiate it to obtain the joint probability density 
function of X and Y. As in the case of univariate random variables, this is often a more 
convenient way of specifying the joint distribution. 


Definition 3.2.3 (Joint probability density function). If the joint cdf of two random variables 
X, Y is differentiable, then their joint pdf is defined as 


fx,Y (x, y) 


d 2 F x>Y ( x , y) 
dxdy 


(3.35) 


If the joint cdf of a random vector X is differentiable, then its joint pdf is defined as 

d n Fx (x) 


fx (®) := 


dx\ dx 2 ■ ■ ■ dx r 


(3.36) 


The joint pdf should be understood as an n-dimensional density, not as a probability (for 
instance, it can be larger than one). In the two-dimensional case, 

lim P (x < X < x + A x ,y <Y <y + A y ) = f x ,Y (x,y) A x A y . (3.37) 

^0. > ij >0 

Due to the monotonicity of joint cdfs in every variable, joint prnfs are always nonnegative. 

The joint pdf of X and Y allows us to compute the probability of any Borel set S C M 2 by 
integrating over S 

P ((A, Y) G S) = [ f x , Y (x, y) dx d y. (3.38) 

J(x,y)€S 

Similarly, the joint pdf of an n-dimensional random vector X allows to compute the probability 
that X belongs to a set Borel set S C M n , 

p(iej) = [ ,/y (x) dx. (3.39) 

^ ' Jxes 

In particular, if we integrate a joint pdf over the whole space M n , then it must integrate to one 
by the Law of Total Probability. 
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Figure 3.1: Triangle lake in Example 3.2.12. 


Example 3.2.4 (Triangle lake). A biologist is tracking an otter that lives in a lake. She decides 
to model the location of the otter probabilistically. The lake happens to be triangular as shown 
in Figure 3.1, so that we can represent it by the set 

Lake := {x \ x\ > 0, X 2 > 0, x\ + X 2 < 1} • (3.40) 


The biologist has no idea where the otter is, so she models the position as a random vector X 
which is uniformly distributed over the lake. In other words, the joint pdf of X is constant, 


fx ( f ) 


c if x G Lake, 
0 otherwise. 


(3.41) 


To find the normalizing constant c we use the fact that to be a valid joint pdf should integrate 
to 1. 




cdxi dx2 


rl /• 1—X2 

/ / cdxidx2 

J x 2 = 0 4xi=0 

c (1 - x 2 ) dx 2 

J X2=0 


(3.42) 

(3.43) 

(3.44) 


so c = 2. 

We now compute the cdf of X. F^ (x) represents the probability that the otter is southwest of 
the point x. Computing the joint cdf requires dividing the range into the sets shown in Figure 3.1 
and integrating the joint pdf. If x € A then F ^ (x) = 0 because P (X < x) = 0. If (x) G B , 


F 


x 


px 2 rx\ 

(*)= / : 

J u=0 J u=0 


2 dv du = 2xix 2 . 


(3.45) 
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If x € C, 

rl—xi rx i rx2 rl—u, 

F^(x)= / / 2dvdu + / 2dvdu = 2x\ + 2x2 ~ x\ — x^ — 1. (3.46) 

Ju=0 Jv =0 Ju=l—x\ Jv=0 

If x G D, 


Fj£ (x) = P (Xi < xi,X 2 < X 2 ^j = P ^Xl <1,X 2 < x 2 ) = (1-^ 2 ) = 2f 2 - (3.47) 


where the last step follows from (3.46). Exchanging xi and x 2 , we obtain Fg (x) = 2xi — x\ for 
x G E by the same reasoning. Finally, for f 6 P F-g (x) = 1 because P < x\. X 2 < X 2 ^ = 1. 
Putting everything together, 


Fx (^) 


0 

2 xix 2 , 

2 xi + 2 x 2 — x| — x\ 
2x2 — x|, 

2xi — xf, 

1 , 


if xi < 0 or X 2 < 0 , 

if xi > 0 , X 2 > 0 , xi + x 2 < 1 , 

1 , if xi < 1 , X 2 < 1 , Xi + X 2 > 1 , 
if xi > 1,0 < X 2 < 1, 
if 0 < xi < 1 , X 2 > 1 , 
if xi > 1 , X 2 > 1 . 


(3.48) 


A 


3.2.2 Marginalization 

We now discuss how to characterize the marginal distributions of individual random variables 
from a joint cdf or a joint pdf. Consider the joint cdf Fx.y (x,y). When x —> 00 the limit 
of Fx.y {x, y ) is by definition the probability of Y being smaller than y, which is precisely the 
marginal cdf of Y. More formally, 


lim Fx.y (x, y) = lim P (U” = i {X <i,Y < y}) 

x —^00 n—>00 

(3.49) 

= P ( lim {X < n,Y < y}) 

\n^ 00 / 

(3.50) 

= P (Y < y) 

(3.51) 

= Fy (y) ■ 

(3.52) 


If the random variables have a joint pdf, we can also compute the marginal cdf by integrating 
over x 


F y (y) = P (Y < y) 
ry r°° 

= / fx,Y (x,u) dxdy. 

J u =—00 J x=—oo 

Differentiating the latter equation with respect to y, we obtain the marginal pdf of Y 

ro o 

,/y (y) = / fx,Y (x, y) dx. 

J x=—oo 


(3.53) 

(3.54) 


(3.55) 
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Similarly, the marginal pdf of a subvector Xj of a random vector X indexed by X := {h,i 2 , • • •, i m } 
is obtained by integrating over the rest of the components {ji,j 2 , • • • ,jn-m} '■= {1,2, /X, 


fx x (xr) = f f 

^ X 31 ^ X 32 



fx ( f ) d Xh d Xb ''' dx jn _ m 


(3.56) 


Example 3.2.5 (Triangle lake (continued)). The biologist is interested in the probability that 
the otter is south of x\. This information is encoded in the cdf of the random vector, we just 
need to take the limit when X 2 —> oo to marginalize over X 2 - 


Fx i (xi) 


0 if x\ < 0 , 

< 2 x\ — x\ if 0 < x\ < 1 , 
1 if x' i > 1 . 


(3.57) 


To obtain the marginal pdf of X \, which represents the latitude of the otter’s position, we 
differentiate the marginal cdf 


fx 1 (®i) 


d F Xl (xi) 

dxi 


2 (1 — x'i) if 0 < x\ < 1 , 
0 , otherwise. 


(3.58) 


Alternatively, we could have integrated the joint uniform pdf over X 2 (we encourage you to check 
that the result is the same). 

A 


3.2.3 Conditional distributions 

In this section we discuss how to obtain the conditional distribution of a random variable given 
information about other random variables in the probability space. To begin with, we consider 
the case of two random variables. As in the case of univariate distributions, we can define the 
joint cdf and pdf of two random variables given events of the form {(A, Y) £ 5} for any Borel 
set in M 2 by applying the definition of conditional probability. 

Definition 3.2.6 (Joint conditional cdf and pdf given an event). Let X, Y be random variables 
with joint pdf fx,Y and let S C M 2 be any Borel set with nonzero probability, the conditional cdf 
and pdf of X and Y given the event (A, A) e S is defined as 

F X ,Y\(x,Y)eS (x, V ) := P {X < x , Y < y\ (A, Y) G 5) 

P (A < x, Y < y, (X, Y) E S) 

P ((A, Y) G S) 

= fu<x,v<y,(u,v)es fxx v ) du dv 
f(u,v)es fx,Y {u, v ) du dv 

, , x 9 2 F Xt Y\(x,Y)e S ( x iV) 

fx,Y\(x,Y)eS (X,y) - -— -• 


(3.59) 

(3.60) 

(3.61) 

(3.62) 


This definition only holds for events with nonzero probability. However, events of the form 
{A = x} have probability equal to zero because the random variable is continuous. Indeed, the 
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range of X is uncountable, so the probability of almost every event {X = x} must be zero, as 
otherwise the probability their union would be unbounded. 

How can we characterize our uncertainty about Y given X = x then? We define a conditional 
pdf that captures what we are trying to do in the limit and then integrate it to obtain a 
conditional cdf. 

Definition 3.2.7 (Conditional pdf and cdf). If Fx y is differentiable, then the conditional pdf 
of Y given X is defined as 

fv\x ( y\x ) := if fx (x) > 0 (3.63) 

fx [x) 

and is undefined otherwise. 

The conditional cdf of Y given X is defined as 

f y\x (y\x) ■■= f fy\x(u\x) du if f x (x) > 0 (3.64) 

J u =—oo 

and is undefined otherwise. 


We now justify this definition, beyond the analogy with (3.18). Assume that fx ( x) > 0. Let us 
write the definition of the conditional pdf in terms of limits. We have 


fx (x) 

fx,v (x, y) 


lim 

A x —X) 


P (x < X < x + A x 


lim —— 

Ax —>-0 /\x 


dP (x < X < x + A x , Y < y) 
dy 


(3.65) 

(3.66) 


This implies 


fx,Y ( x , y) _ lim _1_ dP (x < X < x + A x , Y < y) 

fx (x) A x ->o,A y -+o P (x < X < x + A x ) dy 


We can now write the conditional cdf as 


f y\x (y\x) 


lim 


dP (x < X < x + A x , Y <u) 


a^o,a„->o P (x < X < x + A a 


dy 


d u 


, 1 f y dP(x<X <x + A x ,Y <u) , 

lim —-—--—- / --- du 

A x ^0 P {x < X < X + A x ) y u= _ 0O dy 

lim P {x < X < x + A x , Y < y) 

A x ^o P (x < X < x + A x ) 


lim P (Y < y\x < X < x + A x ). 

Aa ;->0 


(3.68) 

(3.69) 

(3.70) 

(3.71) 


We can therefore interpret the conditional cdf as the limit of the cdf of Y at y conditioned on 
X belonging to an interval around x when the width of the interval tends to zero. 

Remark 3.2.8. Interchanging limits and integrals as in (3.69) is not necessarily justified in 
general. In this case it is, as long as the integral converges and the quantities involved are 
bounded. 
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An immediate consequence of Definition 3.2.7 is the chain rule for continuous random variables. 


Lemma 3.2.9 (Chain rule for continuous random variables). 

fxy (x, y ) = fx (x) f Y \x (y\x) ■ 


(3.72) 


Applying the same ideas as in the bivariate case, we define the conditional distribution of a 
subvector given the rest of the random vector. 


Definition 3.2.10 (Conditional pdf). The conditional pdf of a random subvector X%, T C 
{1,2,... ,n}, given the subvector In. n ui is 


fxx\x { i,..., n}/1 i x z\ x V .«}/l) 


_ fx ( f ) _ 

^{i. n}/! ( X {h-,n}/z) 


(3.73) 


It is often useful to represent the joint pdf of a random vector by factoring it into conditional 
pdfs using the chain rule for random vectors. 

Lemma 3.2.11 (Chain rule for random vectors). The joint pdf of a random vector X can be 
decomposed into 

fx ( f ) = f.X ! ( f l) fx 2 \Xr OW) ■ • • fx n \X\,...,X n —x (*n|£l, • • ■ ,®n-l) (3-74) 

n 

= n fxi\x {1 ,...,i- 1} • ( 3 - 75 ) 

2—1 

Note that the order is arbitrary, you can reorder the components of the vector in any way you 
like. 


Proof. The result follows from applying the definition of conditional pdf recursively. 


□ 


Example 3.2.12 (Triangle lake (continued)). The biologist spots the otter from the shore of 
the lake. She is standing on the west side of the lake at a latitude of x\ = 0.75 looking east and 
the otter is right in front of her. The otter is consequently also at a latitude of x\ = 0.75, but 
she cannot tell at what distance. The distribution of the location of the otter given its latitude 
X\ is characterized by the conditional pdf of the longitude X 2 given X\, 


fx 2 \x 1 (® 2 |*l) 


fx u x 2 (xi,x 2 ) 

fx 1 Od) 

1 

--, 0 < £2 < 1 — x\. 

1 — X\ 


(3.76) 

(3.77) 


The biologist is interested in the probability that the otter is closer than x 2 to her. This 
probability is given by the conditional cdf 


f x 2 \x ! (® 2 |®i) 



fx 2 |Xi (u\x 1 ) d u 


1 — X\ 


(3.78) 

(3.79) 


The probability that the otter is less than x 2 away is 4x 2 for 0 < x 2 < 1/4. 


A 
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Figure 3.2: Joint pdf of a bivariate Gaussian random variable (X,Y) together with the marginal pdfs 
of X and Y. 

3.2.4 Gaussian random vectors 

Gaussian random vectors are a multidimensional generalization of Gaussian random variables. 
They are parametrized by a vector and a matrix that correspond to their mean and covariance 
matrix (we define these quantities for general multivariate random variables in Chapter 4). 

Definition 3.2.13 (Gaussian random vector). A Gaussian random vector X is a random vector 
with joint pdf 

fx ex P \ (® - L) T S _1 (f - fi)^j (3.80) 

where the mean vector ft £ M n and the covariance matrix X, which is symmetric and positive 
definite, parametrize the distribution. A Gaussian distribution with mean fi and covariance 
matrix X is usually denoted by M (/2, X). 

A fundamental property of Gaussian random vectors is that performing linear transformations 
on them always yields vectors with joint distributions that are also Gaussian. We will not prove 
this result formally, but the proof is similar to Lemma 2.5.1 (in fact this is a multidimensional 
generalization of that result). 

Theorem 3.2.14 (Linear transformations of Gaussian random vectors are Gaussian). Let X 
be a Gaussian random vector of dimension n with mean fi and covariance matrix X. For any 
matrix A £ R raxn and b £ M m Y = AX + b is a Gaussian random vector with mean Afi + b and 
covariance matrix AYA T . 
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A corollary of this result is that the joint pdf of a subvector of a Gaussian random vector is also 
a Gaussian vector. 


Corollary 3.2.15 (Marginals of Gaussian random vectors are Gaussian). The joint pdf of any 
subvector of a Gaussian random vector is Gaussian. Without loss of generality, assume that the 
subvector X consists of the first m entries of the Gaussian random vector, 


Z := 

X 

, with mean fl := 

Rx 


Y 


My. 


(3.81) 


and covariance matrix 






Then X is a Gaussian random vector with mean 


and covariance matrix £^. 


(3.82) 


Proof. Note that 


Im 

0 mxn—m 


X 


Im O m xn—m 

On —mxm 

0 n—mxn—m 


Y 


On—mxm O n _mxn—m_ 


(3.83) 


where I G M mxm i s an identity matrix and 0 CX( 2 represents a matrix of zeros of dimensions c X d. 
The result then follows from Theorem 3.2.14. □ 


Figure 3.2 shows the joint pdf of a bivariate Gaussian random variable along with its marginal 
pdfs. 


3.3 Joint distributions of discrete and continuous variables 


Probabilistic models often include both discrete and continuous random variables. However, the 
joint pmf or pdf of a discrete and a continuous random variable is not well defined. In order to 
specify the joint distribution in such cases we use their marginal and conditional pmfs and pdfs. 

Assume that we have a continuous random variable C and a discrete random variable D with 
range R£>. We define the conditional cdf and pdf of C given D as follows. 

Definition 3.3.1 (Conditional cdf and pdf of a continuous random variable given a discrete 
random variable). Let C and D be a continuous and a discrete random variable defined on the 
same probability space. Then, the conditional cdf and pdf of C given D are of the form 


F C \ D ( c\d ) 
fc\D (c| d) 


P (C < c\d ), 
dPb| D (c\d) 
dc 


(3.84) 

(3.85) 


We obtain the marginal cdf and pdf of C from the conditional cdfs and pdfs by computing a 
weighted sum. 
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•icr 2 



Figure 3.3: Conditional and marginal distributions of the weight of the bears W in Example 3.3.3. 
Lemma 3.3.2. Let F c \ D and fc\D be the conditional cdf and pdf of a continuous random variable 


C given a discrete random variable D. Then, 

F c (c) = ^2 PD (d) F C \ D (c\d) , (3.86) 

d&R D 

fc (c) = 22 Pd (d) f c \D ( c\d ). (3.87) 

d&R D 

Proof. The events {D = d} are a partition of the whole probability space (one of them must 
happen and they are all disjoint), so 

F c (c) = P {C < c) (3.88) 

= 22 P (D = d) P {C < c\d ) by the Law of Total Probability (3.89) 

= 22 PD {d) F C \ D (, c\d ). (3.90) 

Now, (3.87) follows by differentiating. □ 


Combining a discrete marginal pmf with a continuous conditional distribution allows us to define 
mixture models where the data is drawn from a continuous distribution whose parameters are 
chosen from a discrete set. If a Gaussian is used as the continuous distribution, this yields a 
Gaussian mixture model. Fitting Gaussian mixture models is a popular technique for clustering 
data. 

Example 3.3.3 (Grizzlies in Yellowstone). A scientist is gathering data on the bears in Yel¬ 
lowstone. It turns out that the weight of the males is well modeled by a Gaussian random 
variable with mean 240 kg and standard variation 40 kg, whereas the weight of the females is 
well modeled by a Gaussian with mean 140 kg and standard deviation 20 kg. There are about 
the same number of females and males. 
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The distribution of the weights of all the grizzlies can consequently be modeled by a Gaussian 
mixture that includes a continuous random variable W to represent the weight and a discrete 
random variable S to represent the sex of the bears. S is Bernoulli with parameter 1/2, W given 
5 = 0 (male) is AT (240,1600) and W given 5=1 (female) is J\f (140,400). By (3.87) the pdf of 
W is consequently of the form 


i 


fw (w) = ^p s (s) f w \ S M«) 


s=0 

1 


2>/27r 


(m-240r (id-140 Y 

e 3200 e soo 

+ 


40 


20 


(3.91) 

(3.92) 


Figure 3.3 shows the conditional and marginal distributions of W. 


A 


Defining the conditional pmf of a discrete random variable D given a continuous random variable 
C is challenging because the probability of the event {C = c} is zero. We follow the same 
approach as in Definition 3.2.7 and define the conditional pmf as a limit. 


Definition 3.3.4 (Conditional pmf of a discrete random variable given a continuous random 
variable). Let C and D be a continuous and a discrete random variable defined on the same 
probability space. Then, the conditional pmf of D given C is defined as 


Pd\c ( d\c ) 


P (D = d, c < C < c + A) 

lim -—--—-—- 

A— >o P (c < C < c + A) 


(3.93) 


Analogously to Lemma 3.3.2, we obtain the marginal pmf of D from the conditional pmfs by 
computing a weighted sum. 


Lemma 3.3.5. Let Pd\c be the conditional pmf of a discrete random variable D given a con¬ 
tinuous random variable C. Then, 


PD 



fc (c) p D \c (d\c) dc. 


(3.94) 


Proof. We will not give a formal proof but rather an intuitive argument that can be made 
rigorous. If we take a grid of values for c which are on a grid ..., c_i, Co, ci ,... of width A, then 

OO 

po{d) = ^ P (D = d, Ci < C < Ci + A) (3.95) 

i =—oo 


by the Law of Total probability. Taking the limit as A —> 0 the sum becomes an integral and 
we have 


Pd 



P (D = d, c < C < c + A) 

lim ---dc 

A->0 A 

P(c<C<c + A) P (D = d, c < C < c + A) 

lim ---•- , -—-—-dc 

A->o A P(c<C<c + A) 

fc ( c)p D \c ( d\c ) dc. 


(3.96) 

(3.97) 

(3.98) 


since f c (c) = lim A ^o P( '~ r A " +A) • 


□ 
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Combining continuous marginal distributions with discrete conditional distributions is particu¬ 
larly useful in Bayesian statistical models, as illustrated in the following example (see Chapter 10 
for more information). The continuous distribution is used to quantify our uncertainty about 
the parameter of a discrete distribution. 

Example 3.3.6 (Bayesian coin flip). Your uncle bets you ten dollars that a coin flip will turn 
out heads. You suspect that the coin is biased, but you are not sure to what extent. To model 
this uncertainty you represent the bias as a continuous random variable B with the following 
pdf: 


f B (b) = 2b for b 6 [0,1] . (3.99) 

You can now compute the probability that the coin lands on heads denoted by X using Lemma 3.3.5. 
Conditioned on the bias B, the result of the coin flip is Bernoulli with parameter B. 

r oo 

Px{ 1 ) = / fB(b)p x \ B (Mb)db (3.100) 

J b =—oo 

= [ 2b 2 db (3.101) 

Jb =0 

= (3.102) 

According to your model the probability that the coin lands heads is 2/3. A 


The following lemma provides an analogue to the chain rule for jointly distributed continuous 
and discrete random variables. 


Lemma 3.3.7 (Chain rule for jointly distributed continuous and discrete random variables). Let 
C be a continuous random variable with conditional pdf fc\D an d D a discrete random variable 
with conditional pmf p D \ c . Then, 


PD (d) f c \D ( c\d ) = f c ( c)p D \ C (d\c). 

Proof. Applying the definitions, 

ta\t ( i. a r (c< C < c+A\D = d) 

Pd [d) f c \D (c| d) = JmP (D = d) - - - 

, P (D = d, c < C < c + A) 

= Inn --- 

A->0 A 

_ lim p (c < C < c + A) P (D = d, c < C < c + A) 
A->o A P(c<C<c + A) 

= fc ( c)p D \ C (d\c). 


(3.103) 

(3.104) 

(3.105) 

(3.106) 

(3.107) 

□ 


Example 3.3.8 (Grizzlies in Yellowstone (continued)). The scientist observes a bear with her 
binoculars. From their size she estimates that its weight is 180 kg. What is the probability that 
the bear is male? 






CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 


51 



Figure 3.4: 


Conditional and marginal distributions of the bias of the coin flip in Example 3.3.9. 


We apply Lemma 3.3.7 to compute 


'Ps\w (0| 180) 


PS (0) fw\s (180|0) 
fw (180) 


io exp 


io ex P 




+ ^exp 



(3.108) 


(3.109) 


= 0.545. 


(3.110) 


According to the probabilistic model, the probability that it’s a male is 0.545. 


A 


Example 3.3.9 (Bayesian coin flip (continued)). The coin lands on tails. You decide to recom¬ 
pute the distribution of the bias conditioned on this information. By Lemma 3.3.7 


Ib\x (6|0) 


f_B ( b)p x \B (0|fc) 

Px (o) 

26(1-6) 

V3 

66 ( 1 - 6 ). 


(3.111) 

(3.112) 

(3.113) 


Conditioned on the outcome, the pdf of the bias is now centered instead of concentrated near 
one as before, as shown in Figure 3.4. 

A 


3.4 Independence 

In this section we define independence and conditional independence for random variables and 
vectors. 
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3.4.1 Definition 

When knowledge about a random variable X does not affect our uncertainty about another 
random variable Y, we say that X and Y are independent. Formally, this is reflected by the 
marginal and conditional cdf and the conditional pmf or pdf which must be equal, i.e. 

Fy (y) = F Y \ X (y\x) (3.114) 

and 

PY (y) = Py\x (y\x) or fy {y) = f Y \x (y\x), (3.115) 

depending on whether the variable is discrete or continuous, for any x and any y for which the 
conditional distributions are well defined. Equivalently, the joint cdf and the conditional pmf or 
pdf factors into the marginals. 

Definition 3.4.1 (Independent random variables). Two random variables X and Y are inde¬ 
pendent if and only if 

Fx,y (x, y) = F x (x) Fy (y ), for all (,x , y) G M 2 . (3.116) 

If the variables are discrete, the following condition is equivalent 

Px,y (x, y) = p x (x) p Y (y), for all x e R x ,y e R Y - (3.117) 

If the variables are continuous have joint and marginal pdfs, the following condition is equivalent 

fx,Y (x, y) = fx {x) fy (y), for all (x, y) G M 2 . (3.118) 

We now extend the definition to account for several random variables (or equivalently several 
entries in a random vector) that do not provide information about each other. 

Definition 3.4.2 (Independent random variables). The n entries X\, X^, ■ ■ ■, X n in a random 
vector X are independent if and only if 

n 

F X ( f ) = T1 Fx (*0 ’ (3.119) 

i =1 

which is equivalent to 

n 

Px ( f ) = II p X (*i) (3.120) 

i =1 

for discrete vectors and 

n 

/*(*) = II/*<*) (3.121) 

i= 1 

for continuous vectors, if the joint pdf exists. 

The following example shows that pairwise independence does not imply independence. 



CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 


53 


Example 3.4.3 (Pairwise independence does not imply joint independence). Let X\ and X 2 
be the outcomes of independent unbiased coin flips. Let X 3 be the indicator of the event 
{X\ and X 2 have the same outcome}, 

fl HX l =X 2 , , 

X 3 = { (3.122) 

\o \iX x + X 2 . 

The pmf of X 3 is 

Px 3 (1) = Px \,x 2 (1,1) +Px 1 } x 2 (0,0) = -, (3.123) 

Px 3 (0) = px!,x 2 (0,1) +Px u x 2 (1,0) = -. (3.124) 

X\ and X 2 are independent by assumption. X\ and X 3 are independent because 

Px i,x 3 (0,0) = p x \,x 2 (0,1) = - = px 1 (0 )px 3 (0) , (3.125) 

Pxi,x 3 (1,0) = Px u x 2 (1,0) = - = px 1 iX)Px 3 (0) , (3.126) 

Pxi,x 3 (0,1) = Px!,x 2 (0,0) = - = px 1 (0 )px 3 (1) , (3.127) 

Px i,x 3 (1,1) = Px u x 2 (1, !) = 4 = P x 1 (1) Px 3 (1) • (3.128) 

X 2 and X 3 are independent too (the reasoning is the same). 

However, are Xi, X 2 and X 3 all independent? 

Px u x 2 ,x 3 (1,1,1) = P (^1 = l , x 2 = 1) = - ^ Px 1 (1 )px 2 (1 )px 3 (1) = -• (3.129) 

They are not, which makes sense since X 3 is a function of X\ and X 2 . A 

Conditional independence indicates that two random variables do not depend on each other, 
as long as an additional random variable is known. 

Definition 3.4.4 (Conditionally independent random variables). Two random variables X and 
Y are independent with respect to another random variable Z if and only if 

Fx,y \z(x,y\z) = F X \ Z (x \ z) F Y \z(y\z), for all (. x , y) € M 2 , (3.130) 

and any z for which the conditional cdfs are well defined. If the variables are discrete, the 
following condition is equivalent 

Px,Y \z(x,y\z)= Px\z{x\z)p Y \z(y\z), for all x G R x ,y & Ry, (3.131) 

and any z for which the conditional pmfs are well defined. If the variables are continuous have 
joint and marginal pdfs, the following condition is equivalent 

fx,Y I z {x, y I z) = fx 1 Z (x | z) f Y \z(y\z), for all ( x , y) € M 2 , (3.132) 

and any z for which the conditional pmfs are well defined. 
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The definition can be extended to condition on several random variables. 


Definition 3.4.5 (Conditionally independent random variables). The components of a sub¬ 
vector X%, T C {1,2, ...,n} are conditionally independent given another subvector Xj, J C 
{1,2,..., ?r}, if and only if 


F x x \Xj ) 

which is equivalent to 

Px T \Xj (*zl *J) 

for discrete vectors and 

fx x \Xj (xxWj) 
for continuous vectors if the conditional joim 


n f x z \xj (®*i %j) i 

(3.133) 

i£l 


IK,* (**1*^) 

(3.134) 

iei 


n^Xi]Xj ( x *l x j) 

i&l 

(3.135) 


pdf exists. 


As established in Examples 1.3.5 and 1.3.6, independence does not imply conditional indepen¬ 
dence or vice versa. 


3.4.2 Variable dependence in probabilistic modeling 

A fundamental consideration when designing a probabilistic model is the dependence between 
the different variables, i.e. what variables are independent or conditional independent from each 
other. Although it may sound surprising, if the number of variables is large, introducing some 
independence assumptions may be necessary to make the model tractable, even if we know that 
all the variables are dependent. To illustrate this, consider a model for the US presidential 
election where there are 50 random variables, each representing a state. If the variables only 
take two possible values (representing what candidate wins that state), the joint pmf of their 
distribution has 2' 50 — 1 > 10 15 degrees of freedom. We wouldn’t be able to store the pmf 
with all the computer memory in the world! In contrast, if we assume that all the variables are 
independent, then the distribution only has 50 free parameters. Of course, this is not necessarily 
a good idea because failing to represent dependencies may severely affect the prediction accuracy 
of a model, as illustrated in Example 3.5.1 below. Striking a balance between tractability and 
accuracy is a crucial challenge in probabilistic modeling. 

We now illustrate how the dependence structure of the random variables in a probabilistic 
model can be exploited to reduce the number of parameters describing the distribution through 
an appropriate factorization of their joint pmf or pdf. Consider three Bernoulli random variables 
A, B and C. In general, we need 7 = 2 3 — 1 parameters to describe the pmf. However, if B and 
C are conditionally independent given A we can perform the following factorization 

PA,B,C =PAPb\aPc\a (3.136) 

which only depends on five parameters (one for pa and two each for Pb\a and Pc\B )• It is 
important to note that there are many other possible factorizations that do not exploit the 
dependence assumptions, such as for example 


PA,B,C = PB PA\B Pc\A,B- 


( 3 . 137 ) 
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Figure 3.5: Example of a directed acyclic graphic representing a probabilistic model. 


For large probabilistic models it is crucial to find factorizations that reduce the number of 
parameters as much as possible. 

3.4.3 Graphical models 

Graphical models are a tool for characterizing the dependence structure of probabilistic models. 
In this section we give a brief description of directed graphical models, which are also called 
Bayesian networks. Undirected graphical models, known as Markov random fields, are out of 
the scope of these notes. We refer the interested reader to more advanced texts in probabilistic 
modeling and machine learning for a more in-depth treatment of graphical models. 

Directed acyclic graphs, known as DAGs, can be interpreted as diagrams representing a factor¬ 
ization of the joint pmf or pdf of a probabilistic model. In order to specify a valid factorization, 
the graphs are constrained to not have any cycles (hence the term acyclic). Each node in the 
DAG represents a random variable. The edges between the nodes indicate the dependence 
between the variables. The factorization corresponding to a DAG contains: 

• The marginal pmf or pdf of the variables corresponding to all nodes with no incoming 
edges. 

• The conditional pmf or pdf of the remaining random variables given their parents. A is a 
parent of B if there is a directed edge from (the node assigned to) A to (the node assigned 
to) B. 

To be concrete, consider the DAG in Figure 3.5. For simplicity we denote each node using the 
corresponding random variable and assume that they are all discrete. Nodes X\ and X 4 have 
no parents, so the factorization of the joint pmf includes their marginal pmfs. Node X 2 only 
descends from X 4 so we include Px 2 \x 4 - Node A 3 descends from X 2 so we include Px 3 \x 2 - 
Finally, node A 5 descends from A 3 and A 4 so we include px 5 x 3 ,x 4 ■ The factorization is of the 
form 


Px u x 2 ,x 3 ,x 4 ,x 5 = Px 1 Px 4 Px 2 1 x 4 Px 3 1 x 2 Px 5 1 x 3 ,x 4 - (3.138) 

This factorization reveals some dependence assumptions. By the chain rule another valid fac¬ 
torization of the joint pmf is 


PX 1 ,X 2 ,X 3 ,X 4 ,X 5 — PX 1 PX 4 | Xi PX 2 | X U X 4 PX 3 | X,,X 2 ,X 4 PX 5 | X 4 ,X 2 ,X 3 ,X 4 - 


(3.139) 
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PL,R,T = PR PL | R PT | R 


PM,R,T = PM PR PL | M,R 



Figure 3.6: Directed graphical models corresponding to the variables in Examples 1.3.5 and 1.3.6. 


Comparing both expressions, we see that X\ and all the other variables are independent, since 
Px 4 \x 1 = Px. 41 Px 2 \x 4 ,x 4 = Px 2 \x 4 and so on. In addition, X 3 is conditionally independent of 
X 4 given X 2 since Px 3 \x 2 ,x 4 = Px 3 \x 2 - These dependence assumptions can be read directly 
from the graph, using the following property. 

Theorem 3.4.6 (Local Markov property). The factorization of the joint pmf or pdf represented 
by a DAG satisfies the local Markov property: each variable is conditionally independent of its 
non-descendants given all its parent variables. In particular, if it has no parents, it is independent 
of its non-descendants. To be clear, B is a non-descendant of A if there is no directed path from 
A to B. 

Proof. Let Xj be an arbitrary variable. We denote by Xx the set of non-descendants of X 4 , by 
Xp the set of parents and by Xp the set of descendants. The factorization represented by the 
graphical model is of the form 

Px u ...,x n = Px N Px P \x N PXi\Xp Px D \Xi- (3.140) 

By the chain rule another valid factorization is 

Px 1 ,...,x n = Px N Px P \x N PXi\x P ,x N Px D \Xi,x P ,x N ■ (3.141) 

Comparing both expressions we conclude that Pxax p ,x n = Pxax p so A, is conditionally inde¬ 
pendent of Xx given Xp. □ 

We illustrate these ideas by showing the DAGs for Examples 1.3.5 and 1.3.6. 

Example 3.4.7 (Graphical model for Example 1.3.5). We model the different events in Ex¬ 
ample 1.3.5 using indicator random variables. T represents whether a taxi is available (T = 1) 
or not (T = 0), L whether the plane is late (L = 1) or not (L = 0), and R whether it rains 
( R = 1) or not (R = 0). In the example, T and L are conditionally independent given R. We 
can represent the corresponding factorization using the graph on the left of Figure 3.6. 

A 

Example 3.4.8 (Graphical model for Example 1.3.6). We model the different events in Exam¬ 
ple 1.3.6 using indicator random variables. M represents whether a mechanical problem occurs 
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Figure 3.7: Fictitious country considered in Example 3.4.9. 


(M = 1) or not (M = 0) and L and R are the same as in Example 3.4.7. In the example, M 
and R are independent, but L depends on both of them. We can represent the corresponding 
factorization using the graph on the right of Figure 3.6. 

A 


The following example that introduces an important class of graphical models called Markov 
chains, which we will discuss at length in Chapter 7. 

Example 3.4.9 (Election). In the country shown in Figure 3.7 the presidential election follows 
the same system as in the United States. Citizens cast ballots for electors in the Electoral 
College. Each state is entitled to a number of electors (in the US this is usually the same as the 
members of Congress). In every state, the electors are pledged to the candidate that wins the 
state. Our goal is to model the election probabilistically. We assume that there are only two 
candidates A and B. Each state is represented by a random variable Si, 1 < i < 4, 


Si 


1 if candidate A wins state i, 
— 1 if candidate B wins state i. 


(3.142) 


An important decision to make is what independence assumptions to assume about the model. 
Figure 3.8 shows three different options. If we model each state as independent, then we only 
need to estimate a single parameter for each state. However, the model may not be accurate, 
as the outcome in states with similar demographics is bound to be related. Another option is 
to estimate the full joint pmf. The problem is that it may be quite challenging to compute 
the parameters. We can estimate the marginal pmfs of the individual states using poll data, 
but conditional probabilities are more difficult to estimate. In addition, for larger models it is 
not tractable to consider fully dependent models (for instance in the case of the US election, 
as mentioned previously). A reasonable compromise could be to model the states that are 
not adjacent as conditionally independent given the states between them. For example, we 
assume that the outcome of states 1 and 3 are only related through state 2. The corresponding 
graphical model, depicted on the right of Figure 3.8, is called a Markov chain. It corresponds 
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Fully independent 


Fully dependent 


Markov chain 




Figure 3.8: Graphical models capturing different assumptions about the distribution of the random 
variables considered in Example 3.4.9. 

to a factorization of the form 


PSi,S 2 ,S 3 ,S4, = PS 1 Ps 2 1 Si PS 3 | S 2 PS 4 1S 3 • (3.143) 

Under this model we only need to worry about estimating pairwise conditional probabilities, as 
opposed to the full joint pmf. We discuss Markov chains at length in Chapter 7. 

A 


We conclude the section with an example involving continuous variables. 

Example 3.4.10 (Desert). Dani and Felix are traveling through the desert in Arizona. They 
become concerned that their car might break down and decide to build a probabilistic model 
to evaluate the risk. They model the time until the car breaks down as an exponential random 
variable T with a parameter that depends on the state of the motor M and the state of the road 
R. These three quantities are represented by random variables in the same probability space. 

Unfortunately they have no idea what the state of the motor is so they assume that it is uniform 
between 0 (no problem with the motor) and 1 (the motor is almost dead). Similarly, they have 
no information about the road, so they also assume that its state is a uniform random variable 
between 0 (no problem with the road) and 1 (the road is terrible). In addition, they assume that 
the states of the road and the car are independent and that the parameter of the exponential 
random variable that represents the time in hours until there is a breakdown is equal to M + R. 
The corresponding graphical model is shown in Figure 3.9 

To find the joint distribution of the random variables, we apply the chain rule to obtain, 

Im,r,t (m, r, t ) = / M (m) f R \ M ( r\m ) f T \M,R ( t\m , r) 

= /m (m) fri ( r ) f T \ M) R (t\m, r ) (by independence of M and R) 

(■m + r) e -( m + r ) t for t > 0, 0 < m < 1 , 0 < r < 1 , 

0 otherwise. 

Note that we start with M and R because we know their marginal distribution, whereas we only 
know the conditional distribution of T given M and R. 

After 15 minutes, the car breaks down. The road seems OK, about a 0.2 in the scale they 
defined for the value of R, so they naturally wonder about the state of the motor. Given their 



(3.144) 

(3.145) 

(3.146) 
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m 



Figure 3.9: The left image is a graphical model representing the random variables in Example 3.4.10. 
The right plot shows the conditional pdf of M given T = 0.25 and R = 0.2. 


probabilistic model, their uncertainty about the motor given all of this information is captured 
by the conditional distribution of M given T and R. 

To compute the conditional pdf, we first need to compute the joint marginal distribution of T 
and R by marginalizing over M. In order to simplify the computations, we use the following 
simple lemma. 


Lemma 3.4.11. For any constant c > 0, 



1 - e -c 
c ’ 

1 — (1 + c) e~ c 


(3.147) 

(3.148) 


Proof. Equation (3.147) is obtained using the antiderivative of the exponential function (itself), 
whereas integrating by parts yields (3.148). A 


We have 


fn.T (r, t) = /m,r,t (m, r, t ) dm 

J m =0 


= e 


—tr 


me tm dm + r e tm dm 


' m=0 


I m =0 


= e 


—tr 


0 —tr 


t‘ 2 


/l - (1 + t) e 1 + r (l — e *) 
(1 + tr - e -< (1 + t + tr)) , 


by (3.147) and (3.148) 


for t > 0 , 0 < r < 1 . 


(3.149) 

(3.150) 

(3.151) 

(3.152) 
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The conditional pdf of M given T and R is 


/m\r,t (m\r, t ) 


Im,r,t {rn, r, t ) 

fR,T ( r, t ) 

(m + r) e -( m + r )* 

^3r- (1 + tr - e~ l (1 + t + tr )) 
(m + r) t 2 e~ tm 
1 + tr — e _t (1 + t + tr) ’ 


(3.153) 

(3.154) 

(3.155) 


for i > 0, 0 < m < 1, 0 < r < 1. Plugging in the observed values, the conditional pdf is equal 
to 


f m\r,t (m|0.2,0.25) 


(m + 0.2)0.25 2 e-°- 25m 
1 + 0.25 • 0.2 - e-°- 25 (1 + 0.25 + 0.25 • 0.2) 
1.66 (m + 0.2) e~°- 25m . 


(3.156) 

(3.157) 


for 0 < m < 1 and to zero otherwise. The pdf is plotted in Figure 3.9. According to the model, 
it seems quite likely that the state of the motor was not good. A 


3.5 Functions of several random variables 


The pmf of a random variable Y := g {X \,..., X n ) defined as a function g : W 1 —> M of several 
discrete random variables X\, ..., X n is given by 

Py(v)= ^2 Px u ...,x n (x (3.158) 

y=g{x 


This follows directly from (3.11). In words, the probability that g [X \,..., X n ) = y is the sum 
of the joint pmf over all possible values such that y = g (xi, ..., x n ). 

Example 3.5.1 (Election). In Example 3.4.9 we discussed several possible models for a presi¬ 
dential election for a country with four states. Imagine that you are trying to predict the result 
of the election using poll data from individual states. The goal is to predict the outcome of the 
election, represented by the random variable 


O := 



if Yh =:l n i s i > 

otherwise, 


(3.159) 


where rii denotes the number of electors in state i (notice that the sum can never be zero). 

From analyzing the poll data you conclude that the probability that candidate A wins each of the 
states is 0.15. If you assume that all the states are independent, this is enough to characterize 
the joint pmf. Table 3.2 lists the probability of all possible outcomes for this model. By (3.158) 
we only need to add up the outcomes for which 0 = 1. Under the full-independence assumption, 
the probability that candidate A wins is 6%. 

You are not satisfied by the result because you suspect that the outcomes in different states 
are highly dependent. From past elections, you determine that the conditional probability of a 






CHAPTER 3. MULTIVARIATE RANDOM VARIABLES 


61 


Si 

s 2 


S 4 

o 

Prob. (indep.) 

Prob. (Markov) 

-1 

-1 

-1 

-1 

0 

0.5220 

0.6203 

-1 

-1 

-1 

1 

0 

0.0921 

0.0687 

-1 

-1 

1 

-1 

0 

0.0921 

0.0431 

-1 

-1 

1 

1 

1 

0.0163 

0.0332 

-1 

1 

-1 

-1 

0 

0.0921 

0.0431 

-1 

1 

-1 

1 

0 

0.0163 

0.0048 

-1 

1 

1 

-1 

1 

0.0163 

0.0208 

-1 

1 

1 

1 

1 

0.0029 

0.0160 

1 

-1 

-1 

-1 

0 

0.0921 

0.0687 

1 

-1 

-1 

1 

0 

0.0163 

0.0077 

1 

-1 

1 

-1 

1 

0.0163 

0.0048 

1 

-1 

1 

1 

1 

0.0029 

0.0037 

1 

1 

-1 

-1 

0 

0.0163 

0.0332 

1 

1 

-1 

1 

1 

0.0029 

0.0037 

1 

1 

1 

-1 

1 

0.0029 

0.0160 

1 

1 

1 

1 

1 

0.0005 

0.0123 


Table 3.2: Table of auxiliary values for Example 3.5.1. 


candidate winning a state if they win an adjacent state is indeed very high. You incorporate 
your estimate of the conditional probabilities into a Markov-chain model described by (3.143): 

PS] (1) = 0.15, (3.160) 

Ps i+1 1 ( 111 ) = 0.435, 2 < i < 4, (3.161) 

Ps i+1 \Si(-l\ — 1) = 0.900 2 < i < 4. (3.162) 

This means that if candidate B wins a state, they are very likely to win the adjacent one. If 
candidate A wins a state, their chance to win an adjacent state is significantly higher than if 
they don’t (but still lower than candidate B). Under this model the marginal probability that 
candidate A wins each state is still 0.15. Table 3.2 lists the probability of all possible outcomes. 
The probability that candidate A wins is now 11%, almost double the probability than that 
obtained under the fully-independent model. This illustrates the danger of not accounting for 
dependencies between states, which for example may have been one of the reasons why many 
forecasts severely underestimated Donald Trump’s chances in the 2016 election. 

A 

Section 2.5 explains how to derive the distribution of functions of univariate random variables by 
first computing their cdf and then differentiating it to obtain their pdf. This directly extends to 
multivariable random functions. Let X, Y be random variables defined on the same probability 
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space, and let U = g (X, Y ) and V = h ( X , 1") for two arbitrary functions g, h : R 2 —>• M. Then, 


F/yy (it, v) = P (£/ < It, V < f) 

= P(< ? (X,V)<u,/i(X,V) <u) 


'{(*>!/) I 9 U,y)<u,h{x,y)<v} 


fx,Y {x,y) dxdy, 


(3.163) 

(3.164) 

(3.165) 


where the last equality only holds if the joint pdf of X and Y exists. The joint pdf can then be 
obtained by differentiation. 


Theorem 3.5.2 (Pdf of the sum of two independent random variables). The pdf of Z = X + Y, 
where X and Y are independent random variables is equal to the convolution of their respective 
pdfs f x and f Y , 


/»oo 

fz (z) = / fx (z - U ) f Y (u) du. 

J u =—oo 


Proof. First we derive the cdf of Z 


F z (z) = P (X + Y < z) 

foe rz-y 


L 


fx (x) ,/V (y) dx d y 


y =—oo j x=— oo 
oo 


/»oo 

/ Fx (z - y) /y (y) dy. 
J y =—oo 


(3.166) 


(3.167) 

(3.168) 

(3.169) 


Note that the joint pdf of X and Y is the product of the marginal pdfs because the random 
variables are independent. We now differentiate the cdf to obtain the pdf. Note that this requires 
an interchange of a limit operator with a differentiation operator and another interchange of 
an integral operator with a differentiation operator, which are justified because the functions 
involved are bounded and integrable. 


d .. 

= —— lim j 

d Z u ->oo J 

ru 

1 F x {z - y) f Y (y) dy 

y=-u 

(3.170) 

= lim 

d 

, j 

ru 

1 F x {z - y) f Y (y) dy 

(3.171) 

u —^OO 

dz J 

y=-u 

= lim 

u —s^oo 

('ll 

Jy=- 

4~ f x (z ~ y) fy (y) dy 

u 

(3.172) 

= lim 

pu 

fx {z - y) f Y (y) dy. 

(3.173) 

'LL — ^OO 

Jy=- 

U 

□ 


Example 3.5.3 (Coffee beans). A company that makes coffee buys beans from two small local 
producers in Colombia and Vietnam. The amount of beans they can buy from each producer 
varies depending on the weather. The company models these quantities C and V as independent 
random variables (assuming that the weather in Colombia is independent from the weather in 
Vietnam) which have uniform distributions in [0,1] and [0, 2] (the unit is tons) respectively. 
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0 0.5 1 1.5 2 2.5 3 


0 0.5 1 1.5 2 2.5 3 


Figure 3.10: Probability density functions in Example 3.5.3. 


We now compute the pdf of the total amount of coffee beans B := E+V applying Theorem 3.5.2, 

fs ( b ) = 


= 


/*oo 

J fc(b- U ) fv 

u) du 

(3.174) 

1 f 2 

x / fc(b-u) d u 
1 Ju=0 


(3.175) 

fUto d«=§ 

if b < 1 


( l fu=b- 1 du = 1 

if 1 < b < 2 

(3.176) 

\ fu=b-l 

if 2 < b < 3. 



The pdf of B is shown in Figure 3.10. 


A 


3.6 Generating multivariate random variables 

In Section 2.6 we consider the problem of generating independent samples from an arbitrary 
univariate distribution. Assuming that a procedure to achieve this is available, we can use it to 
sample from an arbitrary multivariate distribution by generating samples from the appropriate 
conditional distributions. 

Algorithm 3.6.1 (Sampling from a multivariate distribution). Let X\, X 2 , ..., X n be ran¬ 
dom variables belonging to the same probability space. To generate samples from their joint 
distribution we sequentially sample from their conditional distributions: 

1. Obtain a sample x\ of X\. 

2. For i = 2,3,..., n, obtain a sample Xi of X t given the event {X\ = x \,..., X t _\ = 1 } 

by sampling from F x . ('ki, ■ • ■ 

The chain rule implies that the output xi, ..., x n of this procedure are samples from the joint 
distribution of the random variables. The following example considers the problem of sampling 
from a mixture of exponential random variables. 
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Example 3.6.2 (Mixture of exponentials). Let B be a Bernoulli random variable with parame¬ 
ter p and X an exponential random variable with parameter 1 if B = 0 and 2 if B = 1. Assume 
that we have access to two independent samples u\ and 112 from a uniform distribution in [0,1]. 
To obtain samples from B and X: 

1. We set b := 1 if u\ < p and b := 0 otherwise. This ensures that b is a Bernoulli sample 
with the right parameter. 

2. Then, we set 


x:= T log fr^—1 (3.177) 

A \l-u 2 ) 

where A := 1 if b = 0 and A := 2 if b = 1. By Example 2.6.4 x is distributed as an 
exponential with parameter A. 


A 


3.7 Rejection sampling 

We end the chapter by describing rejection sampling, also known as the accept-reject method, an 
alternative procedure for sampling from univariate distributions. The reason we have deferred it 
to this chapter is that analyzing this technique requires an understanding of multivariate random 
variables. Before presenting the method, we motivate it using discrete random variables. 


3.7.1 Rejection sampling for discrete random variables 

Our goal is to simulate a random variable Y using samples from another random variable 
X. To simplify the exposition, we assume that their pmfs px and py have nonzero values 
in the set {1, 2,..., n} (generalizing to other discrete sets is straightforward). The idea behind 
rejection sampling is that we can choose a subset of the samples of A in a way that reshapes 
its distribution. When we obtain a sample of X we decide whether to accept it or reject with 
a certain probability. The probability depends on the value of the sample x, if px(x) is much 
larger than py(x) we should probably reject it most of the time (but not always!). For each 
x E {1, 2,..., n} we define the probability of accepting the sample by a x . 

We are interested in the distribution of only the accepted samples. Mathematically, the pmf of 
the accepted samples is equal to the conditional pmf of X, conditioned on the event that the 
sample is accepted, 


Px I Accepted (x I Accepted) 


px ( x ) P (Accepted | X = x) 
J2i =1 px (i) P (Accepted | X = i) 
Px (x) a x 
EtiPx(i)a■ 


by Bayes’ rule 


(3.178) 

(3.179) 


We would like to fix the accept probabilities so that for all x E {1, 2,..., n} 


Px I Accepted (x | Accepted) = py(x). 


(3.180) 
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This can be achieved by fixing 


ci x \— 


PY jx) 

cp x ( X ) ’ 


x € {1 ,... ,n} , 


(3.181) 


for any constant c. However, this will not yield a valid probability for any arbitrary c, because 
ai could be larger than one! To avoid this issue, we need 


Py (x ) 

c > max --—-. for all a: € |1 ,.... nj 

xe{i,...,n} px (x) 


(3.182) 


Finally, we can use a uniform random variable U between 0 and 1 to accept or reject, accepting 
each sample x if U < a x - You might be wondering why we can’t just generate Y directly from 
U. That would be indeed work and is much simpler; here we are just presenting the discrete 
case as a pedagogical introduction to the continuous case. 


Algorithm 3.7.1 (Rejection sampling). Let X and Y be random variables with pmfs px and 
Py such that 

. PY (x) 

c > max -—— (3.183) 

xe{i,...,n} p x [x) 

for all x such that py(x) is nonzero, and U a random variable that is uniformly distributed in 
[0,1] and independent of X. 


1. Obtain a sample y of X. 

2. Obtain a sample u ofU. 

3. Declare y to be a sample of Y if 


u < 


py jy) 
cpx (y)' 


(3.184) 


3.7.2 Rejection sampling for continuous random variables 

Here we show that the idea presented in the previous section can be applied in the continuous 
case. The goal is to obtain samples according to a target pdf fy by choosing samples obtained 
according to a different pdf fx- As in the discrete case, we need 

fy(y) <cf x (y ) (3.185) 

for all y, where c is a fixed positive constant. In words, the pdf of Y must be bounded by a 
scaled version of the pdf of X. 

Algorithm 3.7.2 (Rejection sampling). Let X be a random variable with pdf fx and U a ran¬ 
dom variable that is uniformly distributed in [0,1] and independent of X. We assume that (3.185) 
holds. 


1. Obtain a sample y of X. 

2. Obtain a sample u ofU. 
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3. Declare y to be a sample of Y if 


u< (3.186) 

cfx (y) 

The following theorem establishes that the samples obtained by rejection sampling have the 
desired distribution. 

Theorem 3.7.3 (Rejection sampling works). If assumption (3.185) holds, then the samples 
produced by rejection sampling are distributed according to fy. 

Proof. Let Z denote the random variable produced by rejection sampling. The cdf of Z is equal 
to 


F z (y) = P (X <y\U < 


fY ( X) 

cfx(X) 

p | x < v TT < F{ x ) 

1 i A ^ l C / Y (x) 


(3.187) 

(3.188) 


To compute the numerator we integrate the joint pdf of U and X over the region of interest 


P X<y,U< 


fY (X) 
c fx (X) 


rv 


fy( x ) 

c f X ( x ) 


L 


x=—oo J u =0 

y fv (x) 


——oo C f X (^) 

i r y .. , . 


f 


C J x= 

= -Fy (y) ■ 

c 


The denominator is obtained in a similar way 

fy ( x ) 


P u < 


cfx (X) 


fY (x) 

r°° f^TxTx) 

' x=—oo J u=0 

r fy (®) 


Jx=—oc C fx (,x‘) 

1 I°° 

- / fy (x) dx 

^ J x =—OO 


fx (x) d u da: 

(3.189) 

fx (x) dx 

(3.190) 

dx 

(3.191) 


(3.192) 

(x) dudx 

(3.193) 

[x) dx 

(3.194) 


We conclude that 

Fz (y) = F Y (y ), 

so the method produces samples from the distribution of Y. 


(3.195) 

(3.196) 

(3.197) 

□ 
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We now illustrate the method by applying it to produce a Gaussian random variable from an 
exponential and a uniform random variable. 

Example 3.7.4 (Generating a Gaussian random variable). In Example 2.6.4 we learned how 
to generate an exponential random variables using samples from a uniform distribution. In this 
example we will use samples from an exponential distribution to generate a standard Gaussian 
random variable applying rejection sampling. 

The following lemma shows that we can generate a standard Gaussian random variable Y by: 


1. Generating a random variable H with pdf 


fH (h) := 



if h > 0 , 
otherwise. 


(3.198) 


2 . Generating a random variable S which is equal to 1 or -1 with probability 1 / 2 , for example 
by applying the method described in Section 2.6.1. 

3. Setting Y := SH. 

Lemma 3.7.5. Let H be a continuous random variable with pdf given by (3.198) and S a discrete 
random variable which equals 1 with probability 1/2 and —1 with probability 1/2. The random 
variable ofY := SH is a standard Gaussian. 


Proof. The conditional pdf of Y given S is given by 


/visG/l 1 ) = | 

f Ih (y) if y > 0, 

1 0 otherwise, 

(3.199) 

fy\s ( y\ - !) = | 

f fH (~y) if y < 0 , 

1 0 otherwise. 

(3.200) 

By Lemma 3.3.5 we have 



fy (y) = ps (i) fy\s (?/1) + ps (~i) fy\s (y\ -1 ) 

(3.201) 

1 ( 

= — exp 

V 2 / v 

-£)• 

(3.202) 



A 


The reason why we reduce the problem to generating H is that its pdf is only nonzero on the 
positive axis, which allows us to bound it with the exponential pdf of an exponential random 
variable X with parameter 1. If we set c := y/2e/i: then frr (x) < cfx (x) for all x, as illustrated 
in Figure 3.11. Indeed, 



fx {x) exp (-x) 



(3.203) 

(3.204) 

(3.205) 
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x 


Figure 3.11: Bound on the pdf of the target distribution in Example 3.7.4. 

We can now apply rejection sampling to generate H. The steps are 

1. Obtain a sample x from an exponential random variable X with parameter one 

2. Obtain a sample u from U. which is uniformly distributed in [0,1]. 

3. Accept x as a sample of H if 

u < exp 




(3.206) 


This procedure is illustrated in Figure 3.12. The rejection mechanism ensures that the accepted 
samples have the right distribution. A 
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Scatterplot of samples 
from X and samples 
from U (accepted 
samples are colored red, 
exp (x — l) 2 /2^ is 
shown in black) 


Histogram of accepted 
samples (fjj is shown in 
black) 



Figure 3.12: Illustration of how to generate 50,000 samples from the random variable H defined in 
Example 3.7.4 via rejection sampling. 

















Chapter 4 

Expectation 


In this section we introduce some quantities that describe the behavior of random variables 
very succinctly. The mean is the value around which the distribution of a random variable is 
centered. The variance quantifies the extent to which a random variable fluctuates around the 
mean. The covariance of two random variables indicates whether they tend to deviate from 
their means in a similar way. In multiple dimensions, the covariance matrix of a random vector 
encodes its variance in every possible direction. These quantities do not completely characterize 
the distribution of a random variable or vector, but they provide a useful summary of their 
behavior with just a few numbers. 

4.1 Expectation operator 

The expectation operator allows us to define the mean, variance and covariance rigorously. It 
maps a function of a random variable or of several random variables to an average weighted by 
the corresponding pmf or pdf. 

Definition 4.1.1 (Expectation for discrete random variables). Let X be a discrete random 
variable with range R. The expected value of a function g ( X ), g : M —> M, of X is 

E (9(X)) := J2g(x)px( x )- (4.1) 

xGR 

Similarly, if X , Y are both discrete random variables with ranges Rx and Ry- then the expected 
value of a function g ( X , Y ), g : M 2 —>• M, of X and Y is 

E (g{X,Y)):= ^ ^ g (x, y) p x , Y (x, y) . (4.2) 

xGRx xGRy 

If X is an n-dimensional discrete random vector, the expected value of a function g(X), g : 
M n -> R, of X is 

E fa*)) ■■= 9 ( f ) p x ( f ) • ( 4 - 3 ) 

X\ X2 Xn 
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Definition 4.1.2 (Expectation for continuous random variables). Let X be a continuous random 
variable. The expected value of a function g (X), g : M —> M, of X is 

POO 

E (9{X))-= l g(x)f x (x)dx. (4.4) 

J x=—oo 


Similarly, if X, Y are both continuous random variables then the expected value of a function 
g ( X , Y), g : M 2 —> M, of X and Y is 


E (g(X,Y)) 



9 (x, y) fx,Y (x, y) dx dy. 


(4.5) 


If X is an n-dimensional random vector, the expected value of a function g {X), g : M n —> M, of 
X is 


E 


POO POO i * 

J XI = — OO J X9 = — oo Jx 


g (x) /u (x) dx i dx 2 ... dx r 


(4.6) 


Xn = — 00 


In the case of quantities that depend on both continuous and discrete random variables, the 
product of the marginal and conditional distributions plays the role of the joint pdf or pmf. 


Definition 4.1.3 (Expectation with respect to continuous and discrete random variables). If 
C is a continuous random variable and D a discrete random variable with range Ro defined on 
the same probability space, the expected value of a function g (C, D ) of C and D is 


E (g(C,D)) 


poo 

■= d(c,d) fc(c)p D \c{d\c) 

J C— — OO 7 r~ D 


dc 


£ 

d^Rf) 


r 00 


g(c,d)p D (d) f c \ D {c\d) dc. 


(4.7) 

(4.8) 


The expected value of a certain quantity may be infinite or not even exist if the correspond¬ 
ing sum or integral tends towards infinity or has an undefined value. This is illustrated by 
Examples 4.1.4 and 4.2.2 below. 

Example 4.1.4 (St Petersburg paradox). A casino offers you the following game. You will flip 
an unbiased coin until it lands on heads and the casino will pay you 2 k dollars where k is the 
number of flips. How much are you willing to pay in order to play? 

Let us compute the expected gain. If the flips are independent, the total number of flips X is a 
geometric random variable, so px (k) = l/2 k . The gain is 2 X which means that 

00 1 

E(Gain) = ^2 fc -^ = oo. (4.9) 

k =1 


The expected gain is infinite, but since you only get to play once, the amount of money that 
you are willing to pay is probably bounded. This is known as the St Petersburg paradox. 

A 


A fundamental property of the expectation operator is that it is linear. 
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Theorem 4.1.5 (Linearity of expectation). For any constant a£l, any function g : M —> M 
and any continuous or discrete random variable X 

E(ag(X)) = aE(g(X)). (4.10) 

For any constants a,b G M, any functions gi,g 2 '■ —> M and any continuous or discrete 

random variables X and Y 

E (a,n ( X , Y ) + bg 2 (X, Y)) = aE( 9l (X, Y)) + bE(g 2 (X, Y)). (4.11) 

Proof. The theorem follows immediately from the linearity of sums and integrals. □ 

Linearity of expectation makes it very easy to compute the expectation of linear functions of 
random variables. In contrast, computing the joint pdf or pmf is usually much more complicated. 

Example 4.1.6 (Coffee beans (continued from Example 3.5.3)). Let us compute the expected 
total amount of beans that can be bought. C is uniform in [0,1], so E (C) = 1/2. V is uniform 
in [0, 2], so E {V) = 1. By linearity of expectation 

E{C + V) = E(C') + E(F) 

= 1.5 tons. 

Note that this holds even if the two quantities are not independent. 

A 

If two random variables are independent, then the expectation of the product factors into a 
product of expectations. 

Theorem 4.1.7 (Expectation of functions of independent random variables). If X,Y are inde¬ 
pendent random variables defined on the same probability space, and g, h : M —> M are univariate 
real-valued functions, then 


(4.12) 

(4.13) 


E(g(X)h(Y)) = E(g(X))E(h(Y)). 


(4.14) 


Proof. We prove the result for continuous random variables, but the proof for discrete random 
variables is essentially the same. 


E(g(X)h(Y)) 


r f 


' X =—OO J y =—oo 
r*oo poo 


9 (%) h (y) fx.Y (x, y) dx d y 

g (x) h (y) fx (x) fy (y) dx d y by independence 


J X =—OO J y =—oo 

E(g(X))E(h(Y)). 


(4.15) 

(4.16) 

(4.17) 


□ 
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Figure 4.1: Probability density function of a Cauchy random variable. 


4.2 Mean and variance 


4.2.1 Mean 


The mean of a random variable is equal to its expected value. 

Definition 4.2.1 (Mean). The mean or first moment of X is the expected value of X: E(X). 


Table 4.1 lists the means of some important random variables. The derivations can be found in 
Section 4.5.1. As illustrated by Figure 4.3, the mean is the center of mass of the pmf or the pdf 
of the corresponding random variable. 

If the distribution of a random variable is very heavy tailed , which means that the probability of 
the random variable taking large values decays slowly, its mean may be infinite. This is the case 
of the random variable representing the gain in Example 4.1.4. The following example shows 
that the mean may not exist if the value of the corresponding sum or integral is not well defined. 

Example 4.2.2 (Cauchy random variable). The pdf of the Cauchy random variable, which is 
shown in Figure 4.1, is given by 

fx(x) = ^TT^j' <4 ' 18) 


By the definition of expected value, 

x 


e(x)= r 

j —c 


/-oo 7r(l + X 2 ) 
Now, by the change of variables t = x 2 


dx = 


f 


7r(l + X 2 ) 


dx — 


f 


7r(l + X 2 ) 


dx. 


POO 

Jo 7r(! 


X 

+ X 2 ) 


dx = / -—73 - dt = lim 

Jo 


log(l + 1) 


= oo, 


I o 27r(l+f) i->oo 27T 

so E(X) does not exist, as it is the difference of two limits that tend to infinity. 


(4.19) 


(4.20) 


A 
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The mean of a random vector is defined as the vector formed by the means of its components. 
Definition 4.2.3 (Mean of a random vector). The mean of a random vector X is 


E(X) : = 


E 

E 



E (x n ) 


(4.21) 


As in the univariate case, the mean can be interpreted as the value around which the distribution 
of the random vector is centered. 

It follows immediately from the linearity of the expectation operator in one dimension that the 
mean operator is linear. 

Theorem 4.2.4 (Mean of linear transformation of a random vector). For any random vector 
X of dimension n, any matrix A E M mxn and b E M m 

E (AX + b) = A E(X) + b. (4.22) 


Proof. 


E (AX + b) 


' E tau^ii^i+&iV 

e (e;Li AxXi +62) 

E (E?=l A miXi + bn) 
"EILr^E (xA+b! 
E"=l A * E ( A ’J + 

E?=l A ™ E (Xi) + b n 

AE(X) + b. 


by linearity of expectation 


(4.23) 


(4.24) 


(4.25) 

□ 


4.2.2 Median 

The mean is often interpreted as representing a typical value taken by the random variable. 
However, the probability of a random variable being equal to its mean may be zero! For instance, 
a Bernoulli random variable cannot equal 0.5. In addition, the mean can be severely distorted 
by a small subset of extreme values, as illustrated by Example 4.2.6 below. The median is an 
alternative characterization of a typical value taken by the random variable, which is designed 
to be more robust to such situations. It is defined as the midpoint of the pmf or pdf of the 
random variable. If the random variable is continuous, the probability that it is either larger or 
smaller than the median is equal to 1/2. 
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- Mean 

- Median 


-10 0 10 20 30 40 50 60 70 80 90 100 110 

x 

Figure 4.2: Uniform pdf in [—4.5,4.5] U [99.5,100.5]. The mean is 10 and the median is 0.5. 


0.1 




Definition 4.2.5 (Median). The median of a discrete random variable X is a number m such 
that 

P (X < m) > - and P (X > m) > -. (4.26) 

The median of a continuous random variable X is a number m such that 

rm x 

F x (m) = f x (x) dx = -. (4.27) 

J ~ CO ^ 


The following example illustrates the robustness of the median to the presence of a small subset 
of extreme values with nonzero probability. 

Example 4.2.6 (Mean vs median). Consider a uniform random variable X with support 
[—4.5, 4.5] U [99.5,100.5]. The mean of X equals 

M.5 rl 00.5 

E (X) = / xf x (x) dx + / xf x (x) dx 

Jx=- 4.5 Jx= 99.5 

1 100.5 2 — 99.5 2 
" 10 2 
= 10 . 

The cdf of X between -4.5 and 4.5 is equal to 

/ m 

fx (x) dx 

-4.5 

m + 4.5 
“ 10 ' 

Setting this equal to 1/2 allows to compute the median which is equal to 0.5. Figure 4.2 shows 
the pdf of X and the location of the median and the mean. The median provides a more realistic 
measure of the center of the distribution. 

A 


(4.31) 

(4.32) 


(4.28) 

(4.29) 

(4.30) 
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Random variable 

Parameters 

Mean 

Variance 

Bernoulli 

P 

P 

p{l-p) 

Geometric 

P 

i 

p 

1-P 

p2 

Binomial 

n, p 

np 

np (1 — p) 

Poisson 

A 

A 

A 

Uniform 

a, b 

a+b 

2 

( b ~a) 2 

12 

Exponential 

A 

i 

A 

1 

V 

Gaussian 

P, cr 

P 

U 2 


Table 4.1: Means and variance of common random variables, derived in Section 4.5.1 of the appendix. 


4.2.3 Variance and standard deviation 

The expected value of the square of a random variable is sometimes used to quantify the energy 
of the random variable. 

Definition 4.2.7 (Second moment). The mean square or second moment of a random variable 
X is the expected value of X 2 : E (V 2 ). 

The definition generalizes to higher moments, defined as E ( X p ) for integers larger than two. The 
mean square of the difference between the random variable and its mean is called the variance 
of the random value. It quantifies the variation of the random variable around its mean and 
is also referred to as the second centered moment of the distribution. The square root of this 
quantity is the standard deviation of the random variable. 


Definition 4.2.8 (Variance and standard deviation). The variance of X is the mean square 
deviation from the mean 


Var(V) :=E^(V-E(V)) 2 ) 

(4.33) 

= E (X 2 ) - E 2 (X). 

(4.34) 

The standard deviation ax of X is 


a x ■= \/Var (X). 

(4.35) 


We have compiled the variances of some important random variables in Table 4.1. The deriva¬ 
tions can be found in Section 4.5.1. In Figure 4.3 we plot the pmfs and pdfs of these random 
variables and display the range of values that fall within one standard deviation of the mean. 

The variance operator is not linear, but it is straightforward to determine the variance of a linear 
function of a random variable. 















fx 0 ) px{k) 
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Geometric (p = 0.2) Binomial (n = 20, p = 0.5) Poisson (A = 25) 





Uniform [0,1] 


Exponential (A = 1) 


Gaussian (p = 0, a = 1) 





Figure 4.3: Pmfs of discrete random variables (top row) and pdfs of continuous random variables 
(bottom row). The mean of the random variable is marked in red. Values that are within one standard 
deviation of the mean are marked in pink. 


Lemma 4.2.9 (Variance of linear functions). For any constants a and b 


Var (a X + b) = a 2 Var (X) . (4.36) 

Proof. 

Var (aX + b) = E ((a X + b - E (a X + b)) 2 ^ (4.37) 

= E({aX + b-aE(X)-b) 2 ^ (4.38) 

= a 2 E ((AT - E (V)) 2 ) (4.39) 

= a 2 Var (V). (4.40) 

□ 


This result makes sense: If we change the center of the random variable by adding a constant, 
then the variance is not affected because the variance only measures the deviation from the 
mean. If we multiply a random variable by a constant, the standard deviation is scaled by the 
same factor. 
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4.2.4 Bounding probabilities using the mean and variance 

In this section we introduce two inequalities that allow to characterize the behavior of a random 
valuable to some extent just from knowing its mean and variance. The first is the Markov 
inequality, which quantifies the intuitive idea that if a random variable is nonnegative and small 
then the probability that it takes large values must be small. 

Theorem 4.2.10 (Markov’s inequality). Let X be a nonnegative random variable. For any 
positive constant a > 0, 


P {X>a)< E ^ > -. (4.41) 

a 

Proof. Consider the indicator variable 1 x>a- We have 

X-al x >a>0. (4.42) 

In particular its expectation is nonnegative (as it is the sum or integral of a nonnegative quantity 
over the positive real line). By linearity of expectation and the fact that Ix>a is a Bernoulli 
random variable with expectation P (A" > a) we have 

E (X) > a E (ljt>a) = a P (X > a). (4.43) 

□ 


Example 4.2.11 (Age of students). You hear that the mean age of NYU students is 20 years, 
but you know quite a few students that are older than 30. You decide to apply Markov’s 
inequality to bound the fraction of students above 30 by modeling age as a nonnegative random 
variable A. 


P(A > 30) < ^ 


2 

3' 


At most two thirds of the students are over 30. 


(4.44) 

A 


As illustrated Example 4.2.11, Markov’s inequality can be rather loose. The reason is that it 
barely uses any information about the distribution of the random variable. 

Chebyshev’s inequality controls the deviation of the random variable from its mean. Intuitively, 
if the variance (and hence the standard deviation) is small, then the probability that the random 
variable is far from its mean must be low. 

Theorem 4.2.12 (Chebyshev’s inequality). For any positive constant a > 0 and any random 
variable X with bounded variance, 

P (| A — E (A)[ > a) < Var ( X \ (4.45) 

a z 

Proof. Applying Markov’s inequality to the random variable Y = (X — E (A)) 2 yields the result. 

□ 
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An interesting corollary to Chebyshev’s inequality shows that if the variance of a random variable 
is zero, then the random variable is a constant or, to be precise, the probability that it deviates 
from its mean is zero. 

Corollary 4.2.13. If Var (X) = 0 then P (X / E (X)) = 0. 


Proof. Take any e > 0, by Chebyshev’s inequality 

P(|A-E(A)[ >e)< Var f° =0. 

e z 


(4.46) 

□ 


Example 4.2.14 (Age of students (continued)). You are not very satisfied with your bound on 
the number of students above 30. You find out that the standard deviation of student age is 
actually just 3 years. Applying Chebyshev’s inequality, this implies that 


P(A > 30) < P (| A - E (A)| > 10) 

^ Var (A) _ 9 
“ 100 “ 100 ' 

So actually at least 91% of the students are under 30 (and above 10). 


(4.47) 

(4.48) 

A 


4.3 Covariance 

4.3.1 Covariance of two random variables 

The covariance of two random variables describes their joint behavior. It is the expected value 
of the product between the difference of the random variables and their respective means. In¬ 
tuitively, it measures to what extent the random variables fluctuate together. 

Definition 4.3.1 (Covariance). The covariance of X andY is 

Cov (X, Y ) := E ((X - E (A)) (Y - E (Y))) (4.49) 

= E (XY) -E(X)E (Y) . (4.50) 

If Cov (A, Y) = 0, X and Y are uncorrelated. 

Figure 4.4 shows samples from bivariate Gaussian distributions with different covariances. If 
the covariance is zero, then the joint pdf has a spherical form. If the covariance is positive and 
large, then the joint pdf becomes skewed so that the two variables tend to have similar values. 
If the covariance is large and negative, then the two variables will tend to have similar values 
with opposite sign. 

The variance of the sum of two random variables can be expressed in terms of their individ¬ 
ual variances and their covariance. As a result, their fluctuations reinforce each other if the 
covariance is positive and cancel each other if it is negative. 

Theorem 4.3.2 (Variance of the sum of two random variables). 


Var (A + Y) = Var (A) + Var (Y) + 2 Cov (A, Y). 


(4.51) 
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Cov {X, Y) 0.5 


0.9 


0.99 




Cov {X, Y) 0 


-0.9 


-0.99 




Figure 4.4: Samples from 2D Gaussian vectors (A', Y), where X and Y are standard Gaussian random 
variables with zero mean and unit variance, for different values of the covariance between X and Y. 


E ((X + Y — E (X + E)) 2 ) (4.52) 

E ((X - E (X)) 2 ) + E ((y - E (y)) 2 ) + 2E ((X - E (X)) (Y - E (Y))) 

Var (X) + Var (X) + 2 Cov (X, Y). (4.53) 

□ 

An immediate consequence is that if two random variables are uncorrelated, then the variance 
of their sum equals the sum of their variances. 

Corollary 4.3.3. If X andY are uncorrelated, then 

Var (X + y) = Var (X) + Var (Y). (4.54) 

The following lemma and example show that independence implies uncorrelation, but uncorre¬ 
lation does not always imply independence. 

Lemma 4.3.4 (Independence implies uncorrelation). If two random variables are independent, 
then they are uncorrelated. 

Proof. By Theorem 4.1.7, if X and Y are independent 

Cov (x, y) = e (xy) - e (x) e (y) = e (x) e (y) - e (x) e (y) = o. ( 4 . 55 ) 

□ 


Proof. 

Var (X + V) = 
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Example 4.3.5 (Uncorrelation does not imply independence). Let X and Y be two independent 
Bernoulli random variables with parameter 1/2. Consider the random variables 


U = X + Y, (4.56) 

V = X- Y. (4.57) 

Note that 

PU (0) = P(X = 0,Y = 0) = ±, (4.58) 

Pv (0) = P(X = 1,Y = 1) + P(X = 0, Y = 0) = (4.59) 

Puy (0,0) = P (X = 0, Y = 0) = ^ ^ pu (0) p v (0) = (4.60) 

so U and V are not independent. However, they are uncorrelated as 

Cov (U, U) = E (UV) — E (U) E (V) (4.61) 

= E ((X + Y) {X — Y)) — E (X + Y) E (X - Y) (4.62) 

= E (X 2 ) - E (Y 2 ) - E 2 (X) + E 2 (y) = 0. (4.63) 

The final equality holds because X and Y have the same distribution. 

A 


4.3.2 Correlation coefficient 

The covariance does not take into account the magnitude of the variances of the random variables 
involved. The Pearson correlation coefficient is obtained by normalizing the covariance using 
the standard deviations of both variables. 

Definition 4.3.6 (Pearson correlation coefficient). The Pearson correlation coefficient of two 
random variables X and Y is 

Cov (X, Y) 

Pxy :=- 

(J.y cry 

The correlation coefficient between X and Y is equal to the covariance between X/ax and 
Yjay. Figure 4.5 compares samples of bivariate Gaussian random variables that have the same 
correlation coefficient, but different covariance and vice versa. 

Although it might not be immediately obvious, the magnitude of the correlation coefficient is 
bounded by one because the covariance of two random variables cannot exceed the product 
of their standard deviations. A useful interpretation of the correlation coefficient is that it 
quantifies to what extent X and Y are linearly related. In fact, if it is equal to 1 or -1 then one 
of the variables is a linear function of the other! All of this follows from the Cauchy-Schwarz 
inequality. The proof is in Section 4.5.3. 

Theorem 4.3.7 (Cauchy-Schwarz inequality). For any random variables X and Y defined on 
the same probability space 


(4.64) 


E(XY)\ < y/E (X 2 )E (Y 2 ). 


(4.65) 
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ay = 1, Cov ( X , Y) = 0.9, cry = 3, Cov (X, Y) = 0.9, 
Px,Y = 0.9 Px,y = 0.3 


ay = 3, Cov (X, Y) = 2.7, 
Px,Y = 0.9 





Figure 4.5: Samples from 2D Gaussian vectors (X, Y), where X is a standard Gaussian random variables 
with zero mean and unit variance, for different values of the standard deviation uy of Y (which is mean 
zero) and of the covariance between X and Y. 


Assume E (A" 2 ) / 0, 


E (AY) = y/E (A 2 ) E (Y 2 ) 4=^ Y 

E (AY) = -y/E (X 2 )E(Y 2 ) Y 

Corollary 4.3.8. For any random variables X and Y, 

Cov (A, Y) < dx(Ty. 


E(r 2 ) 

E(X 2 ) ’ 


E (Y 2 ) 
E (A 2 ) 


X. 


Equivalently, the Pearson correlation coefficient satisfies 

\px,y\ < 1, 

with equality if and only if there is a linear relationship between X and Y 

|px.y| = 1 Y = cX + d. 


where 


XT ifpx, Y = 1 , 
ifpx,Y = ~ 1 , 


d := E (Y) — cE (A). 


Proof. Let 


U := A-E(A), 
V := Y - E (Y). 


From the definition of the variance and the correlation coefficient, 

E (£/ 2 ) = Var (A), 

E (V 2 ) = Var (Y) 

E (UV) 

PXX ~ y/E (U 2 ) E {V 2 ) 

The result now follows from applying Theorem 4.3.7 to U and V. 


(4.66) 

(4.67) 

(4.68) 

(4.69) 

(4.70) 

(4.71) 


(4.72) 

(4.73) 

(4.74) 

(4.75) 

(4.76) 
□ 
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4.3.3 Covariance matrix of a random vector 

The covariance matrix of a random vector captures the interaction between the components 
of the vector. It contains the variance of each component in the diagonal and the covariances 
between different components in the off diagonals. 


Definition 4.3.9. The covariance matrix of a random vector X is defined as 


: = 


Var(XiJ Cov(Xi,X 2 
Cov (x 2 :.X\) Var (x 2 


Cov yX n ,X[j Cov(X n ,X 2 
= E (XX T ) - E(X)E(X) T . 


Cov (x u X r 
Cov(X 2 ,X n 


Var X 


(4.77) 


(4.78) 


Note that if all the entries of a vector are uncorrelated, then its covariance matrix is diagonal. 

From Theorem 4.2.4 we obtain a simple expression for the covariance matrix of the linear 
transformation of a random vector. 

Theorem 4.3.10 (Covariance matrix after a linear transformation). Let X be a random vector 
of dimension n with covariance matrix E. For any matrix A G M mxn and b G M m ; 

s ax + S = as x aT - ( 4 -™) 

Proof. 

E Ax+b = E + &) ( A * + b) T ^j - E (ax + £) E (ax + 

= 4E (xX T ^j A T + bE(xY A T + AE(X)l? + 

- AE(X)E{X) T A T - TE(X)h r - bE(X) T A T - U? 

= A (e (xX T ^j - E(X)E(X) t ) A t 
= AE^A T . 

□ 


(4.80) 

(4.81) 

(4.82) 

(4.83) 


An immediate corollary of this result is that we can easily decode the variance of the random 
vector in any direction from the covariance matrix. Mathematically, the variance of the random 
vector in the direction of a unit vector v is equal to the variance of its projection onto v. 

Corollary 4.3.11. Let v be a unit vector, 

) = 


Var (v T X 


(4.84) 
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Consider the eigendecomposition of the covariance matrix of an n-dimensional random vector 
X 


X f = UAU t 


= [ill U2 


u, 


Ai 0 
0 A 2 

0 0 


[tZi U 2 


U t, 


(4.85) 

(4.86) 


where the eigenvalues are ordered Ai > A 2 > ... > A n . Covariance matrices are symmetric 
by definition, so by Theorem B.7.1 the eigenvectors ui, & 2 , ..., u n can be chosen to be or¬ 
thogonal. These eigenvectors and the eigenvalues completely characterize the variance of the 
random vector in different directions. The theorem is a direct consequence of Corollary 4.3.11 
and Theorem B.7.2. 


Theorem 4.3.12. Let X 

eigendecomposition of S ^ 


be a random vector of dimension n with covariance matrix X 
given by (4.86) satisfies 

Ai = max Var ( v T X ) , 

Pl 2 =i V J 


u\ 


A k 


Uk 


arg max Var [v X 


max Var IvX 

arg max Var ( v T 

U«|| a =l,u±fli,...,Sfc_i V 



The 

(4.87) 

(4.88) 

(4.89) 

(4.90) 


In words, u\ is the direction of maximum variance. The eigenvector U 2 corresponding to the 
second largest eigenvalue A 2 is the direction of maximum variation that is orthogonal to u\. In 
general, the eigenvector Uk corresponding to the kth. largest eigenvalue A k reveals the direction 
of maximum variation that is orthogonal to 5i, U 2 , • ■ •, Uk-1 - Finally, u n is the direction of 
minimum variance. Figure 4.6 illustrates this with an example, where n = 2. As we discuss 
in Chapter 8, principal component analysis- a popular method for unsupervised learning and 
dimensionality reduction- applies the same principle to determine the directions of variation of 
a data set. 

To conclude the section, we describe an algorithm to transform samples from an uncorrelated 
random vector so that they have a prescribed covariance matrix. The process of transforming 
uncorrelated samples for this purpose is called coloring because uncorrelated samples are usually 
described as being white noise. As we will see in the next section, coloring allows to simulate 
Gaussian random vectors. 

Algorithm 4.3.13 (Coloring uncorrelated samples). Let x be a realization from an n-dimensional 
random vector with covariance matrix I. To generate samples with covariance matrix X, we: 

1. Compute the eigendecomposition X = UAU T . 
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y/AT = 1.22, = 0.71 


a/Ai — 1, \/A2 — 1 


VAT = 1-38, -y/Xj = 0.32 



Figure 4.6: Samples from bivariate Gaussian random vectors with different covariance matrices are 
shown in gray. The eigenvectors of the covariance matrices are plotted in red. Each is scaled by the 
square roof of the corresponding eigenvalue Ai or A 2 . 


2. Set y := U\fXx, where \/A is a diagonal matrix containing the square roots of the eigen¬ 
values of £, 


VA:= 



0 

VA2 


0 

0 


0 0 • • • 


By Theorem 4.3.10 the covariance matrix of Y := Uy/Xx indeed equals £. 

£ ? = uVae^Va t u t 

= uVaiVa t u t 

= £. 


(4.91) 


(4.92) 

(4.93) 

(4.94) 


Figure 4.7 illustrates the two steps of coloring in 2D: First the samples are stretched according to 
the eigenvalues of £ and then they are rotated to align them with the corresponding eigenvectors. 


4.3.4 Gaussian random vectors 

We have mostly used Gaussian vectors to visualize the different properties of the covariance 
operator. As opposed to other random vectors, Gaussian random vectors are completely de¬ 
termined by their mean and their covariance matrix. An important consequence, is that if the 
entries of a Gaussian random vector are uncorrelated then they are also mutually independent. 

Lemma 4.3.14 (Uncorrelation implies mutual independence for Gaussian random vectors). If 
all the components of a Gaussian random vector X are uncorrelated, this implies that they are 
mutually independent. 

Proof. The parameter £ of the joint pdf of a Gaussian random vector is its covariance matrix (one 
can verify this by applying the definition of covariance and integrating). If all the components 
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Figure 4.7: When we color two-dimensional uncorrelated samples (left), first the diagonal matrix a/A 
stretches them differently along different directions according to the eigenvalues of the desired covariance 
matrix (center) and then U rotates them so that they are aligned with the correspondent eigenvectors 
(right). 


are uncorrelated then 


a\ 0 
0 a\ 


0 0 


0 

0 


(4.95) 


where Ui is the standard deviation of the ith component. Now, the inverse of this diagonal 
matrix is just 


r i 


ST 1 = 
x 


i 


0 0 


°n-- 


and its determinant is |£| = fl/Li a l so that 


1 


V / (2vr) n |S| 

71 

i r_I_ 

r=i 

n 

Ufxi ^ • 

i=1 


fx (®) = ?7X \n T^7 eX P ( - A (* - ^) T S 1 (* - M) 


exp 


(xi - ViY 

2a? 


(4.96) 


(4.97) 

(4.98) 

(4.99) 


Since the joint pdf factors into a product of the marginals, the components are all mutually 
independent. □ 


The following algorithm generates samples from a Gaussian random vector with an arbitrary 
mean and covariance matrix by coloring (and centering) a vector of independent samples from 
a standard Gaussian distribution. 
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Algorithm 4.3.15 (Generating a Gaussian random vector). To sample from an n-dimensional 
Gaussian random vector with mean ft and covariance matrix £, we: 

1. Generate a vector x containing n independent standard Gaussian samples. 

2. Compute the eigendecomposition £ = UAU T . 

3. Set y := U\J.Ax + jl, where \/A is defined by (8.20). 

The algorithm just centers and colors the random vector Y : = U\J~AX + fi. By linearity of 
expectation its mean is 

E(y) = uVXE{X) + fi (4.100) 

= fi (4.101) 

since the mean of X is zero. The same argument used in equation (4.94) shows that the covari¬ 
ance matrix of X is £. Since coloring and centering are linear operations, by Theorem 3.2.14 
Y is Gaussian with the desired mean and covariance matrix. For example, in Figure 4.7 the 
generated samples are Gaussian. For non-Gaussian random vectors, coloring will modify the 
covariance matrix, but not necessarily preserve the distribution. 

4.4 Conditional expectation 

Conditional expectation is a useful tool for manipulating random variables. Unfortunately, it can 
be somewhat confusing (as we see below it’s a random variable not an expectation!). Consider 
a function g of two random variables X and Y. The expectation of g conditioned on the event 
X = x for any fixed value x can be computed using the conditional pmf or pdf of Y given X. 

E (g(X,Y) \X = x) = J ~2g{x,y)p Y \x (y\x), (4.102) 

y£R 

if Y is discrete and has range R, whereas 

/•OO 

E(g( x ,Y)\X = x)= g(x,y) f Y \x(y\x) dy, (4.103) 

J y =—oo 

if Y is continuous. 

Note that E (g ( X , F') \X = x) can actually be interpreted as a function of x since it maps every 
value of x to a real number. This allows to define the conditional expectation of g ( X , Y) given 
X as follows. 

Definition 4.4.1 (Conditional expectation). The conditional expectation of g (X,Y) given X 
is 


E(g(X,Y)\X):=h(X), (4.104) 

where 


h(x) :=E(g(X,Y)\X = x). 


(4.105) 
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Beware the confusing definition, the conditional expectation is actually a random variable! 

One of the main uses of conditional expectation is applying iterated expectation for computing 
expected values. The idea is that the expected value of a certain quantity can be expressed as 
the expectation of the conditional expectation of the quantity. 

Theorem 4.4.2 (Iterated expectation). For any random variables X and Y and any function 
g : M 2 —> M 


E (g (X,Y)) = E(E (g (X,Y) \X)). 


(4.106) 


Proof. We prove the result for continuous random variables, the proof for discrete random 
variables, and for quantities that depend on both continuous and discrete random variables, is 
almost identical. To make the explanation clearer, we define 


Now, 


h{x) :=E(g(X,Y)\X = x) 

POO 

= / g(x,y) f Y \x(y\x) d y. 

J y=— oo 


E(E(5 (X,Y)\X)) = E(h(X)) 

roc 

= h(x) fx (x) dx 

J x= —OO 
roc roc 

= / fx(x)f Y \x(y\x)g{x,y)dydx 

J x=—oo J V =—oo 


' x=—oc J y =—oo 

= E(g(X,Y)). 


(4.107) 

(4.108) 


(4.109) 

(4.110) 

(4.111) 

(4.112) 


□ 


Iterated expectation allows to obtain the expectation of quantities that depend on several quan¬ 
tities very easily if we have access to the marginal and conditional distributions. We illustrate 
this with several examples taken from the previous chapters. 

Example 4.4.3 (Desert (continued from Example 3.4.10)). Let us compute the mean time at 
which the car breaks down, i.e. the mean of T. By iterated expectation 


E (T) = E (E (T\M,R)) 

1 


= E 


M + R 


because T is exponential when conditioned on M and R 


l r l 


1 

/0 Jo m + r 

fl 



dm dr 


= / log (r + 1) - 

Jo 

= log 4 « 1.39 


(4.113) 

(4.114) 

(4.115) 

(4.116) 

(4.117) 


log (r) dr 

integrating by parts. 


A 
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Example 4.4.4 (Grizzlies in Yellowstone (continued from Example 3.3.3)). 
mean weight of a bear in Yosemite. By iterated expectation 

Let us compute the 

E {W) = E(E(JT|S)) 

(4.118) 

E {W\S = 0) + E (TE 5 = 1) 

2 

(4.119) 

= 180 kg. 

(4.120) 


A 


Example 4.4.5 (Bayesian coin flip (continued from Example 3.3.6). Let us compute the mean 
of the coin-flip outcome X. By iterated expectation 


E(X) = E(E(X| B)) 

= E ( B ) because X is Bernoulli when conditioned on B 


= [ 2b 2 db 
Jo 
_ 2 

“ 3' 


(4.121) 

(4.122) 

(4.123) 

(4.124) 

A 


4.5 Proofs 

4.5.1 Derivation of means and variances in Table 4.1 
Bernoulli 


E(X)=p x (l)=p, (4.125) 

E (X 2 ) = p x (1), (4.126) 

Var (X) = E (X 2 ) - E 2 (X) =p(l-p). (4.127) 


Geometric 

To compute the mean of a geometric random variable, we need to deal with a geometric series. 
By Lemma 4.5.3 in Section 4.5.2 below we have: 


EPO 


"22 k p x ( k ) 

k=l 

oo 

Y.kpil-pf-' 

k =1 


P 

1 — p 


fc=i 


1 

p' 


(4.128) 

(4.129) 


(4.130) 




CHAPTER 4. EXPECTATION 


90 


To compute the mean square value we apply Lemma 4.5.4 in the same section: 


OO 


^2k 2 p x (k) 

(4.131) 

k= 1 


^k 2 p{\-p) k 1 

(4.132) 

k= 1 


^ fe=l 

(4.133) 

2 — p 

p2 

(4.134) 


Binomial 

As shown in Example 2.2.6, we can express a binomial random variable with parameters n and 
p as the sum of n independent Bernoulli random variables B\, T> 2 ,... with parameter p 

n 

X = ^Bi. (4.135) 

i =1 

Since the mean of the Bernoulli random variables is p, by linearity of expectation 

n 

E(X) = ^E {Bi) = np. (4.136) 

2—1 

Note that E (L?|) = p and E (BiBj) = p 2 by independence, so 

( n n \ 

(4.137) 

*=1 3 =1 / 

n n—1 n 

= ^2e(B 2 ) +2^ = np + n(n- l)p 2 . (4.138) 

2=1 2=1 i=j-\-l 

Poisson 

From calculus we have 

OO » 

Y,Ji=P 139) 

k=0 

which is the Taylor series expansion of the exponential function. This implies 


OO 


E(I) = ^fc Px (fc) 

(4.140) 

fc=i 


X k e~ x 

(4.141) 

00 \m+l 
— A X —> A . 

— e — A, 

z ' m! 

m=0 

(4.142) 
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Y k 2 p x (k) 

(4.143) 

k= 1 


^ k X k e~ x 

k < k - 

(4.144) 

.-a /•fS- (fc - 1) A fc k\ k \ 

(k- 1)! (k- 1)1) 

(4.145) 

e" A f V . + V . = A 2 + A. 

\ ZJ m! z —' m! / 

\m=l m=l / 

(4.146) 


Uniform 

We apply the definition of expected value for continuous random variables to obtain 

„ _ . , f b x , 


Similarly, 


/ oo ru 

xfx (x) dx = 

-oo J a 

b 2 — a 2 _ a + b 
~ 2 (b-a) ~ 2 


f b t 2 

b 3 — a 3 
3 (b — a) 
a 2 + ab + b 2 
~ 3 ' 


Exponential 

Applying integration by parts, 


E (A) = / xfx {x)dx 

J — OO 
roo 

= / xXe~ Xx dx 


/•OO 1 

= xe~ Xx ]g° + / e~ Xx dx = 

./n A 


Similarly, 


E (X 2 ) = / x 2 Xe~ Xx dx 

Jo 


/•oo 9 

= x 2 e~ Xx ]g° + 2 / xe _A:r dx = —. 

./n A z 


(4.147) 

(4.148) 


(4.149) 

(4.150) 

(4.151) 


(4.152) 

(4.153) 

(4.154) 

(4.155) 

(4.156) 
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Gaussian 

We apply the change of variables t = (x — n) /a. 


/ OO 

xf x (x) dx 

-OO 


' —OO 
r»oo 


X (x-h) 

e 2or2 dx 


l—OO yphfa 

coo 2 


CT 


\/27r J-c 
= Pi 


_ , u, 

te 2 df + 




/: 


e 2 dt 


(4.157) 

(4.158) 

(4.159) 

(4.160) 


where the last step follows from the fact that the integral of a bounded odd function over a 
symmetric interval is zero. 

Applying the change of variables t = (x — n) /a and integrating by parts, we obtain that 


/ OO 

-OO 



x 2 fx (x) dx 

x 2 _ O-/ 0 2 
,_ e 2 (j 2 dx 

\/2 na 




2 dt 


(4.161) 

(4.162) 

(4.163) 

(4.164) 

(4.165) 


4.5.2 Geometric series 

Lemma 4.5.1. For any a^O and any integers ri\ and n 2 

v n l _ ™ n 2 + l 


n 2 

E 

k=m 


i, a 1 — a 
a k = 


1 — a 


Corollary 4.5.2. If 0 < a < 1 


(4.166) 


E 

k =0 


a = 


1 — a 


(4.167) 


Proof. We just multiply the sum by the factor (1 — a) / (1 — a) which obviously equals one, 

a ni + a ni+1 + • • • + a" 2 " 1 + o n2 = (a ni + o ni+1 + • • • + a" 2 " 1 + o n2 ) (4.168) 

1 — a 


a ni — a ni+1 + a ni+1 + • • • — o n2 + a n2 — a 


112 i rv n 2 _ ,^2 + 1 


l — o 


a ni — a n2+1 
1 — a 


(4.169) 

□ 
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Lemma 4.5.3. For 0 < a < 1 


E* 

k =1 


k O 

or = 


(1 - aY 


Proof. By Corollary 4.5.2, 


E 

k =0 


a k = 


1 — Q 


Since the left limit converges, we can differentiate on both sides to obtain 


oo 1 

^ (1-a) 2 


fc =0 


Lemma 4.5.4. For 0 < a < 1 


y;fe 2 a fc = Q!(1+a 3 ) . 

h d -«) 3 


Proof. By Lemma 4.5.3, 


E* 


a = 


\2 • 


k =1 

Since the left limit converges, we can differentiate on both sides to obtain 

o° 

Yk 2 a k ~' = _2+^L 


(4.170) 


(4.171) 


(4.172) 

□ 


(4.173) 


(4.174) 


(4.175) 

□ 


4.5.3 Proof of Theorem 4.3.7 

If E (A' 2 ) = 0 then X = 0 by Corollary 4.2.13 X = 0 with probability one, which implies 
E (XY) = 0 and consequently that equality holds in (4.65). The same is true if E (y 2 ) = 0. 


Now assume that E (A 2 ) 7 ^ 0 and E (y 2 ) 7 ^ 0. Let us define the constants a = \/E ( Y 2 ) and 
b = J E (A 2 ). By linearity of expectation, 

E((aA + &y) 2 ) = a 2 E(A 2 ) + b 2 E (Y 2 ) +2o6E(AT) (4.176) 

= 2 (e (a 2 ) e (y 2 ) + -y/e (A 2 )E(y 2 )E(Ay)), ( 4 . 177 ) 

E ((aX — &y) 2 ) = a 2 E (A 2 ) + 6 2 E (y 2 ) -2a6E(Ay) (4.178) 

= 2 (e (A 2 ) E (y 2 ) - y/E (A 2 ) E (y 2 )E (Ay)) . (4.179) 
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The expectation of a nonnegative quantity is nonzero because the integral or sum of a non¬ 
negative quantity is nonnegative. Consequently, the left-hand side of (4.176) and (4.178) is 
nonnegative, so (B.117) and (B.118) are both nonnegative, which implies (4.65). 

Let us prove (B.21) by proving both implications. 

(=>). Assume E (XY) = -y'E (X 2 )E(y 2 ). Then (B.117) equals zero, so 


E 


(Ve {X 2 )X + y/E(X2) V)' 


= 0 , 


(4.180) 


which by Corollary 4.2.13 means that a/E (Y 2 )X = — yjE (X 2 )Y with probability one. 

e( y 2 ) 

(<=)■ Assume Y = — E ^ Y 2 ) X- Then one can easily check that (B.117) equals zero, which 
implies E(AT') = -y/E (X 2 )E(T 2 ). 

The proof of (B.22) is almost identical (using (4.176) instead of (B.117)). 









Chapter 5 


Random Processes 


Random processes, also known as stochastic processes, are used to model uncertain quantities 
that evolve in time: the trajectory of a particle, the price of oil, the temperature in New York, the 
national debt of the United States, etc. In these notes we introduce a mathematical framework 
that makes it possible to reason probabilistically about such quantities. 

5.1 Definition 

We denote random processes using a tilde over an upper case letter X. This is not standard 
notation, but we want to emphasize the difference with random variables and random vectors. 
Formally, a random process X is a function that maps elements in a sample space U to real¬ 
valued functions. 

Definition 5.1.1 (Random process). Given a probability space (RJ 7 , P), a random process X 
is a function that maps each element oj in the sample space Q to a function X (ui, •) : T —> R, 
where T is a discrete or continuous set. 

There are two possible interpretations for X (ui,t): 

• If we fix oj. then X (oj,t) is a deterministic function of t known as a realization of the 
random process. 

• If we fix t then X (cu, t ) is a random variable , which we usually just denote by X ( t ). 

We can consequently interpret X as an infinite collection of random variables indexed by t. The 
set of possible values that the random variable X (i) can take for fixed t is called the state 
space of the random process. Random processes can be classified according to the indexing 
variable or to their state space. 

• If the indexing variable t is defined on R, or on a semi-infinite interval ( to,oo ) for some 
to € R, then X is a continuous-time random process. 

• If the indexing variable t is defined on a discrete set, usually the integers or the natural 
numbers, then X is a discrete-time random process. In such cases we often use a different 
letter from t, such as i, as an indexing variable. 
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Figure 5.1: Realizations of the continuous-time (left) and discrete-time (right) random process defined 
in Example 5.1.2. 


• If X (t) is a discrete random variable for all t. then X is a discrete-state random process. 
If the discrete random variable takes a finite number of values that is the same for all t, 
then X is a finite-state random process. 

• If X (t) is a continuous random variable for all t, then X is a continuous-state random 
process. 

Note that there are continuous-state discrete-time random processes and discrete-state continuous¬ 
time random processes. Any combination is possible. 

The underlying probability space (17, J 7 , P) mentioned in the definition completely determines 
the stochastic behavior of the random process. In principle we can specify random processes 
by defining (1) a probability space (17, J 7 , P) and (2) a mapping that assigns a function to each 
element of 17, as illustrated in the following example. This way of specifying random processes 
is only tractable for very simple cases. 

Example 5.1.2 (Puddle). Bob asks Mary to model a puddle probabilistically. When the puddle 
is formed, it contains an amount of water that is distributed uniformly between 0 and 1 gallon. 
As time passes, the water evaporates. After a time interval t the water that is left is t times less 
than the initial quantity. 

Mary models the water in the puddle as a continuous-state continuous-time random process 
C. The underlying sample space is (0,1), the a algebra is the corresponding Borel a algebra 
(all possible countable unions of intervals in (0,1)) and the probability measure is the uniform 
probability measure on (0,1). For a particular element in the sample space u £ (0,1) 

C(u,t):=j, t G [1, oo), (5.1) 

where the unit of t is days in this example. Figure 6.1 shows different realizations of the random 
process. Each realization is a deterministic function on [l,oo). 

Bob points out that he only cares what the state of the puddle is each day, as opposed to at any 
time t. Mary decides to simplify the model by using a continuous-state discrete-time random 
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process D. The underlying probability space is exactly the same as before, but the time index 
is now discrete. For a particular element in the sample space uo E (0,1) 

D(u,%):=-, i = 1,2,... (5.2) 

i 

Figure 6.1 shows different realizations of the continuous random process. Note that each real¬ 
ization is just a deterministic discrete sequence. 

A 

Recall that the value of the random process at a specific time is a random variable. We can 
therefore characterize the behavior of the process at that time by computing the distribution 
of the corresponding random variable. Similarly, we can consider the joint distribution of the 
process sampled at n fixed times. This is given by the nth-order distribution of the random 
process. 

Definition 5.1.3 (nth-order distribution). The nth-order distribution of a random process X 
is the joint distribution of the random variables X (t\), X (t 2 ), X (t n ) for any n samples 
{ti, t- 2 , ■ ■ ■, t n } of the time index t. 

Example 5.1.4 (Puddle (continued)). The first-order cdf of C ( t ) in Example 5.1.2 is 


II 

(TT 

r >0 

P (c ( t ) < x) 


(5.3) 

= I 

(uj < t x) 


(5.4) 


7«=o du = tx 

if 0 < x < \ , 


= < 

1 

if x > j , 

(5.5) 


0 

V. 

if x < 0. 


We obtain the first-order pdf by differentiating. 



TT 

II 

if 0 < x < j , 
otherwise. 

(5.6) 




A 


If the nth order distribution of a random process is shift-invariant, then the process is said to 
be strictly or strongly stationary. 

Definition 5.1.5 (Strictly/strongly stationary process). A process is stationary in a strict or 
strong sense if for any n > 0 if we select n samples t\, t 2 , ■■■, t n and any displacement r 
the random variables X (t±), X {t 2 ), ..., X (t n ) have the same joint distribution as X (t\ + r), 
X (t 2 + t) ; X (t n + r). 

The random processes in Example 5.1.2 are clearly not strictly stationary because their first- 
order pdf and pmf are not the same at every point. An important example of strictly stationary 
processes are independent identically-distributed sequences, presented in Section 5.3. 

As in the case of random variables and random vectors, defining the underlying probability 
space in order to specify a random process is usually not very practical, except for very simple 
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cases like the one in Example 5.1.2. The reason is that it is challenging to come up with a 
probability space that gives rise to a given n-th order distribution of interest. Fortunately, we 
can also specify a random process by directly specifying its n-th order distribution for all values 
of n = 1,2,... This completely characterizes the random process. Most of the random processes 
described in this chapter, e.g. independent identically-distributed sequences, Markov chains, 
Poisson processes and Gaussian processes, are specified in this way. 

Finally, random processes can also be specified by expressing them as functions of other random 
processes. A function Y := g{X) of a random process X is also a random process, as it maps 
any element u in the sample space Q to a function Y (cu,-) := g{X (u;,-)). In Section 5.6 we 
define random walks in this way. 

5.2 Mean and autocovariance functions 

As in the case of random variables and random vectors, the expectation operator allows to derive 
quantities that summarize the behavior of the random process. The mean of the random vector 
is the mean of X (t) at any fixed time t. 

Definition 5.2.1 (Mean). The mean of a random process is the function 

Mx(t):=E (*(*)). (5.7) 

Note that the mean is a deterministic function of t. The autocovariance of a random process is 
another deterministic function that is equal to the covariance of X (t±) and X ( t 2 ) for any two 
points t\ and t 2 ■ If we set t\ := t 2 , then the autocovariance equals the variance at t\. 

Definition 5.2.2 (Autocovariance). The autocovariance of a random process is the function 

Rx (h,t 2 ) ■= Cov (x (ti), X (t 2 )) . (5.8) 

In particular, 

Rx(t,t):=Vai(x(tj\. (5.9) 

Intuitively, the autocovariance quantifies the correlation between the process at two different 
time points. If this correlation only depends on the separation between the two points, then the 
process is said to be wide-sense stationary. 

Definition 5.2.3 (Wide-sense/weakly stationary process). A process is stationary in a wide or 
weak sense if its mean is constant 

l l x (*) : = l 1 (5-!0) 

and its autocovariance function is shift invariant, i.e. 

Rx {tl,t2) ■= Rx (tl + T,t2 + r) (5-11) 

for any t\ and f 2 and any shift r. For weakly stationary processes, the autocovariance is usually 
expressed as a function of the difference between the two time points, 

R\ ( s ) := Rx (A ^ + s ) f° r an y t. 


(5.12) 


CHAPTER 5. RANDOM PROCESSES 


99 


Autocovariance function 
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Figure 5.2: Realizations (bottom three rows) of Gaussian processes with zero mean and the autocovari¬ 
ance functions shown on the top row. 
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Note that any strictly stationary process is necessarily weakly stationary because its first and 
second-order distributions are shift invariant. 

Figure 5.2 shows several stationary random processes with different autocovariance functions. If 
the autocovariance function is zero everywhere except at the origin, then the values of the random 
processes at different points are uncorrelated. This results in erratic fluctuations. When the 
autocovariance at neighboring times is high, the trajectory random process becomes smoother. 
The autocorrelation can also induce more structured behavior, as in the right column of the 
figure. In that example X (i) is negatively correlated with its two neighbors X (i — 1) and 
X (i + 1), but positively correlated with X (i — 2) and X (i + 2). This results in rapid periodic 
fluctuations. 


5.3 Independent identically-distributed sequences 

An independent identically-distributed (iid) sequence A is a discrete-time random process where 
X ( i ) has the same distribution for any fixed i and X (i\), X ( 12 ), ..., X (i n ) are mutually 
independent for any n fixed indices and any n > 2. If X ( 1 ) is a discrete random variable (or 
equivalently the state space of the random process is discrete), then we denote the pmf associated 
to the distribution of each entry by p-y. This pdf completely characterizes the random process, 
since for any n indices ii,i2, ■ ■ ■ ,i n and any n: 

n 

Px(ii),X(h),...,X(i n ) ( X h ’ x isi - ■ ■ > s *n) = II Px( x i)- ( 5 ‘ 13 ) 

i= 1 

Note that the distribution that does not vary if we shift every index by the same amount, so 
the process is strictly stationary. 

Similarly, if X (i) is a continuous random variable, then we denote the pdf associated to the 
distribution by /y. For any n indices i±, 12 , ■ ■ ■, i n and any n we have 

n 

fX(ii),X(i 2 ),...,X(in) ( X h’ x i2i ■ ■ ■ Ai„) = IJ/x ( X *) ' (5-14) 

2—1 

Figure 5.3 shows several realizations from iid sequences which follow a uniform and a geometric 
distribution. 

The mean of an iid random sequence is constant and equal to the mean of its associated distri¬ 
bution, which we denote by p, 

^(*):=E(l(i)) (5.15) 

= fi. (5.16) 

Let us denote the variance of the distribution associated to the iid sequence by a 2 . The auto¬ 
covariance function is given by 

% (z, j) := E (X 00 X (jO) - E (X (*)) E (x (j)) (5.17) 

= {a ’ <5 ' 18) 

This is not surprising, X (i) and X ( j ) are independent for all i 7 ^ j, so they are also uncorrelated. 
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Figure 5.3: Realizations of an iid uniform sequence in (0,1) (first row) and an iid geometric sequence 
with parameter p = 0.4 (second row). 


5.4 Gaussian process 


A random process X is Gaussian if any set of samples is a Gaussian random vector. A Gaussian 
process X is fully characterized by its mean function and its autocovariance function R^. 
For all t\, t' 2 , ..., t n and any n > 1, the random vector 


x(h) 

X (t n ) 


is a Gaussian random vector with mean 




~Px (h y 

Tx (^) 

JAx (tn). 


(5.19) 


(5.20) 
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and covariance matrix 

R^(ti,t 2 ) ■■■ Rx(ti,t n )' 

%(tl,t 2 ) Rx(t 2,t 2 ) ■■■ Rx(t 2 ,t n ) 

:= : : .. ; ( 5 - 21 ) 

_Rx 2; ^ra) ' ' ' ^ n ) 

Figure 5.2 shows realizations of several discrete Gaussian processes with different autocovariance 
functions. Sampling from a Gaussian random process boils down to sampling a Gaussian random 
vector with the appropriate mean and covariance matrix. 

Algorithm 5.4.1 (Generating a Gaussian random process). To sample from an Gaussian ran¬ 
dom process with mean function and autocovariance function X^ at n points t\,..., t n we: 


1. Compute the mean vector fig given by (5.20) and the covariance matrix given by (5.21). 

2. Generate n independent samples from a standard Gaussian. 

3. Color the samples according to X^ and center them around jlas described in Algo¬ 
rithm 4-3.15. 


5.5 Poisson process 

In Example 2.2.8 we motivate the definition of Poisson random variable by deriving the distri¬ 
bution of the number of events that occur in a fixed time interval under the following conditions: 

1. Each event occurs independently from every other event. 

2. Events occur uniformly. 

3. Events occur at a rate of A events per time interval. 

We now assume that these conditions hold in the semi-infinite interval [0, oo) and define a random 
process N that counts the events. To be clear N (t) is the number of events that happen between 
0 and t. 

By the same reasoning as in Example 2.2.8, the distribution of the random variable N (f 2 ) — 
N (ti), which represents the number of events that occur between t\ and i 2 , is a Poisson random 
variable with parameter A (f 2 — t\). This holds for any ti and f 2 . In addition the random 
variables N (t 2 ) — N ( t\ ) and N (£ 4 ) — N ( 1 3 ) are independent as along as the intervals [fi, f 2 ] and 
(£ 3 ^ 4 ) do not overlap by Condition 1. A Poisson process is a discrete-state continuous random 
process that satisfies these two properties. 

Poisson processes are often used to model events such as earthquakes, telephone calls, decay of 
radioactive particles, neural spikes, etc. Figure 2.6 shows an example of a real scenario where 
the number of calls received at a call center is well approximated as a Poisson process (as long as 
we only consider a few hours). Note that here we are using the word event to mean something 
that happens , such as the arrival of an email, instead of a set within a sample space, which is 
the meaning that it usually has elsewhere in these notes. 
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Figure 5.4: Events corresponding to the realizations of a Poisson process N for different values of the 
parameter A. N ( t) equals the number of events up to time t. 


Definition 5.5.1 (Poisson process). A Poisson process with parameter X is a discrete-state 
continuous random process N such that 

1. N (0) = 0. 

2. For any t\ < t% < t% < t± N (< 2 ) — N (ti) is a Poisson random variable with parameter 

A {t-2 — ii). 

3. For any t\ < t 2 < t% < t .4 the random variables N (t 2 ) — -ZV(ii) and N (£ 4 ) — N (£ 3 ) are 
independent. 
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We now check that the random process is well defined, by proving that we can derive the joint 
pmf of N at any n points t\ < t 2 < ... <t n for any n > 0. To alleviate notation let p ^A, x^ be 
the value of the pmf of a Poisson random variable with parameter A at x, i.e. 

/- \ A x e - ^ 

P ( X 'V := ^T' <5 ' 22) 

We have 

( x i> • • ■ ’ Xn ) (5.23) 

= P (n (h) = xi,...,N (t n ) = x n ) (5-24) 

= P (n (h) = xi,N (t 2 ) - N (ti) = x 2 - xi,..., N (t n ) - N (t n - 1 ) = x„ - x„_i) (5.25) 

= P (n (h) = xi) P (^N (t 2 ) - N (ti) = x 2 - xi'j ... P (n (t n ) - N (t n - 1 ) =x n - x n -i ^ 

= p(Xti,xi)p(X(t 2 - h),x 2 - xi).. .p(X (t n - t n - 1 ) ,x n - x„_i). (5.26) 

In words, we have expressed the event that N (ti) = x* for 1 < i < n in terms of the random 
variables N (t\) and N ( ti)—N (ti- 1 ), 2 < i < n, which are independent Poisson random variables 
with parameters Afi and A (U — U-i) respectively. 

Figure 5.4 shows several sequences of events corresponding to the realizations of a Poisson 
process N for different values of the parameter A (N (t) equals the number of events up to time 
t). Interestingly, the interarrival time of the events, i.e. the time between contiguous events, 
always has the same distribution: it is an exponential random variable. 

Lemma 5.5.2 (Interarrival times of a Poisson process are exponential). Let T denote the time 
between two contiguous events in a Poisson process with parameter A. T is an exponential 
random variable with parameter A. 

The proof is in Section 5.7.1 of the appendix. Figure 2.11 shows that the interarrival times of 
telephone calls at a call center are indeed well modeled as exponential. 

Lemma 5.5.2 suggests that to simulate a Poisson process all we need to do is sample from an 
exponential distribution. 

Algorithm 5.5.3 (Generating a Poisson random process). To sample from a Poisson random 
process with parameter X we: 

1. Generate independent samples from an exponential random variable with parameter X t\, 
t 2 , t3, 

2. Set the events of the Poisson process to occur at t\, t\ + t 2 , t\ + t 2 + ts, ... 

Figure 5.4 was generated in this way. To confirm that the algorithm allows to sample from a 
Poisson process, we would have to prove that the resulting process satisfies the conditions in 
Definition 5.5.1. This is indeed the case, but we omit the proof. 

The following lemma, which derives the mean and autocovariance functions of a Poisson process 
is proved in Section 5.7.2. 



CHAPTER 5. RANDOM PROCESSES 


105 


Lemma 5.5.4 (Mean and autocovariance of a Poisson process). The mean and autocovariance 
of a Poisson process equal 

E (x (t)^ = Xt, (5.27) 

Rx = Xmm{ti,t 2 } ■ (5.28) 

The mean of the Poisson process is not constant and its autocovariance is not shift-invariant, so 
the process is neither strictly nor wide-sense stationary. 

Example 5.5.5 (Earthquakes). The number of earthquakes with intensity at least 3 on the 
Richter scale occurring in the San Francisco peninsula is modeled using a Poisson process with 
parameter 0.3 earthquakes/year. What is the probability that there are no earthquakes in the 
next ten years and then at least one earthquake over the following twenty years? 

We define a Poisson process X with parameter 0.3 to model the problem. The number of 
earthquakes in the next 10 years, i.e. X (10), is a Poisson random variable with parameter 
0.3 • 10 = 3. The earthquakes in the following 20 years, X (30) — X (10), are Poisson with 
parameter 0.3 • 20 = 6 . The two random variables are independent because the intervals do not 
overlap. 


P (a (10) = 0, A (30) > l) 

= p(x (10) = 0, A (30) - A (10) > l) 

(5.29) 


= P (a (10) = o) P (a (30) - A (10) > l) 

(5.30) 


= P (A (10) = o) (l - P (A (30) - A (10) = o)) 

(5.31) 


= e" 3 (1 - e" 6 ) = 4.97 10~ 2 . 

(5.32) 

The probability is 4.97%. 


A 


5.6 Random walk 


A random walk is a discrete-time random process that models a sequence of steps in random 
directions. To specify a random walk formally, we first define an iid sequence of steps S such 
that 


S(i) = 



with probability 
with probability 


(5.33) 


We define a random walk X as the discrete-state discrete-time random process 


X(i) 


0 

£5=i su) 


for i = 0 , 
for 7 = 1,2,... 


(5.34) 


We have specified A as a function of an iid sequence, so it is well defined. Figure 5.5 shows 
several realizations of the random walk. 

X is symmetric (there is the same probability of taking a positive step and a negative step) and 
begins at the origin. It is easy to define variations where the walk is non-symmetric and begins 
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Figure 5.5: Realizations of the random walk defined in Section 5.5. 


at another point. Generalizations to higher dimensional spaces- for instance to model random 
processes on a 2D surface- are also possible. 

We derive the first-order pmf of the random walk in the following lemma, proved in Section 5.7.3 
of the appendix. 

Lemma 5.6.1 (First-order pmf of a random walk). The first-order pmf of the random walk X 
is 


Px(i ) ( x ) 



if i + x is even and —i<x<i 
otherwise. 


(5.35) 


The first-order distribution of the random walk is clearly time-dependent, so the random process 
is not strictly stationary. By the following lemma, the mean of the random walk is constant (it 
equals zero). The autocovariance, however, is not shift invariant, so the process is not weakly 
stationary either. 

Lemma 5.6.2 (Mean and autocovariance of a random walk). The mean and autocovariance of 
the random walk X are 


Proof. 


Tx (*) — 0’ 

R X (bl) = min {b j} ■ 


/ijf « :=E(X (i)) 


= E 



= E E ( s w>) 



by linearity of expectation 


(5.36) 

(5.37) 

(5.38) 

(5.39) 

(5.40) 


(5.41) 
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% (i,j) := E (X (i) X (j)) - E (X (t)) E (x (j)) (5.42) 

= E (EE^)S(0] (5-43) 

Vfc=i 1=1 J 

^min {i,j} i j ^ 

= E £ 5(fc) 2 + ^^5(fe)5(0 (5.44) 

k= 1 k =1 Z=1 

\ Ok / 

min{j,j} j j 

= E 1 + EE E (s(*0) E (s(O) (5.45) 

k= 1 fc=l /=1 

l^k 

= min{?',j}, (5.46) 

where (5.45) follows from linearity of expectation and independence. □ 


The variance of X at i equals R^(i,i) = i which means that the standard deviation of the 
random walk scales as y/i. 

Example 5.6.3 (Gambler). A gambler is playing the following game. A fair coin is flipped 
sequentially. Every time the result is heads the gambler wins a dollar, every time it lands on 
tails she loses a dollar. We can model the amount of money earned (or lost) by the gambler as a 
random walk, as long as the flips are independent. This allows us to estimate that the expected 
gain equals zero or that the probability that the gambler is up 6 dollars or more after the first 
10 flips is 

P (gambler is up $6 or more) = p^ (10) (6) + Vx{w) ( 8 ) + Px( 10 ) ( 10 ) ( 5 - 47 ) 

_ / 10 \ 1 / 10 \ 1 1 

“ \8 )¥o + V9 + 

= 5.4710" 2 . 


5.7 Proofs 

5.7.1 Proof of Lemma 5.5.2 

We begin by deriving the cdf of T, 

F t (t) := P (T < t) 

= 1 - P (T > t) 

= 1 — P (no events in an interval of length t) 

= 1 - e~ xt 

because the number of points in an interval of length t follows a Poisson distribution with 
parameter A t. Differentiating we conclude that 

/t (t) = Xe~ xt . 


(5.48) 

(5.49) 
A 


(5.50) 

(5.51) 

(5.52) 

(5.53) 


(5.54) 
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5.7.2 Proof of Lemma 5.5.4 

By definition the number of events between 0 and t is distributed as a Poisson random variables 


with parameter A t and hence its mean is equal to A t. 

The autocovariance equals 

% (h, t 2 ) := E (X (ti) X (hj) ~ E (X (ti)) E (x (t 2 j) (5.55) 

= E (X (ti) X (t 2 )) - X 2 ht 2 . (5.56) 

By assumption X (t\) and X (t 2 ) — X ( t\) are independent so that 

E (X (h) X (t 2 )) = E (X (h) (X (t 2 ) - X (ti)) + X (ti) 2 ) (5.57) 

= E (X (tO) E (X (t 2 ) - X (t0) + E (X (tO 2 ) (5.58) 

= A 2 ti ( t 2 — ti) + Ati + A 2 t\ (5.59) 

= X 2 t\t 2 + Xti. (5.60) 


5.7.3 Proof of Lemma 5.6.1 

Let us define the number of positive steps S+ that the random walk takes. Given the assumptions 
on 5, this is a binomial random variable with parameters i and 1/2. The number of negative 
steps is S- := i — S + . In order for X ( i ) to equal x we need for the net number of steps to equal 
x, which implies 


x = S + -S- 
= 25+ - i. 


(5.61) 

(5.62) 


This means that S + must equal 'AA. We conclude that 


pm w = p(E Si 

J=o 

1 


l±x I OJ 
2 / Z 


l = X 


(5.63) 


Z X 

if- is an integer between 0 and i. 


2 


(5.64) 



Chapter 6 


Convergence of Random Processes 


In this chapter we study the convergence of discrete random processes. This allows to charac¬ 
terize two phenomena that are fundamental in statistical estimation and probabilistic modeling: 
the law of large numbers and the central limit theorem. 

6.1 Types of convergence 

Let us quickly recall the concept of convergence for a deterministic sequence of real numbers aq, 
X 2 , ... We have 


lim Xi = x (6-1) 

i—>oo 

if Xi is arbitrarily close to x as the index i grows. More formally, the sequence converges to x if 
for any e > 0 there is an index ?'o such that for all indices i greater than io we have |aq — x\ < e. 
Recall that any realization of a discrete-time random process X (u,i) where we fix the outcome 
u is a deterministic sequence. Establishing convergence of such realizations to a fixed number 
can therefore be achieved by computing the corresponding limit. However, if we consider the 
random process itself instead of a realization and we want to determine whether it eventually 
converges to a random variable X, then deterministic convergence no longer makes sense. In 
this section we describe several alternative definitions of convergence, which allow to extend this 
concept to random quantities. 

6.1.1 Convergence with probability one 

Consider a discrete random process X and a random variable X defined on the same probability 
space. If we hx an element ui of the sample space P, then X (i,u) is a deterministic sequence 
and X (co) is a constant. It is consequently possible to verify whether X (i, u) converges deter¬ 
ministically to X ( u ) as i —»• oo for that particular value of u. In fact, we can ask: what is the 
probability that this happens? To be precise, this would be the probability that if we draw a; 
we have 


lim X = X (w). (6-2) 

i—>oo 


If this probability equals one then we say that X (i) converges to X with probability one. 
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Figure 6.1: Convergence to zero of the discrete random process D defined in Example 5.1.2. 


Definition 6.1.1 (Convergence with probability one). A discrete random vector X converges 
with probability one to a random variable X belonging to the same probability space (H, J~, P) if 


u | w £ fi, lim X (u,i) = X (w) >1=1. (6.3) 

i— >oo J J 

Recall that in general the sample space 0 is very difficult to define and manipulate explicitly, 
except for very simple cases. 

Example 6.1.2 (Puddle (continued from Example 5.1.2)). Let us consider the discrete random 
process D defined in Example 5.1.2. If we fix w 6 (0,1) 

lim D (a;, i) = lim — (6.4) 

i —¥oo i—yoo i 

= 0. (6.5) 

It turns out the realizations tend to zero for all possible values of lo in the sample space. This 
implies that D converges to zero with probability one. 

A 



6.1.2 Convergence in mean square and in probability 

To verify convergence with probability one we fix the outcome oj and check whether the corre¬ 
sponding realizations of the random process converge deterministically. An alternative viewpoint 
is to fix the indexing variable i and consider how close the random variable X (z) is to another 
random variable X as we increase i. 

A possible measure of the distance between two random variables is the mean square of their 
difference. If E((X — Y ) 2 ) = 0 then X = Y with probability one by Chebyshev’s inequality. 
The mean square deviation between X ( i ) and A is a deterministic quantity (a number), so we 
can evaluate its convergence as i —> oo. If it converges to zero then we say that the random 
sequence converges in mean square. 



CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 


111 


Definition 6.1.3 (Convergence in mean square). A discrete random process X converges in 
mean square to a random variable X belonging to the same probability space if 

2 ^ 


lim E 

i—>■ oo 


X-X(i 


= 0 . 


( 6 . 6 ) 


Alternatively, we can consider the probability that X (i) is separated from A by a certain fixed 
e > 0. If for any e, no matter how small, this probability converges to zero as i —> oo then we 
say that the random sequence converges in probability. 

Definition 6.1.4 (Convergence in probability). A discrete random process X converges in prob¬ 
ability to another random variable X belonging to the same probability space if for any e > 0 


lim P 

i—>oo 


X - X (i) 



= 0 . 


(6.7) 


Note that as in the case of convergence in mean square, the limit in this definition is deterministic, 
as it is a limit of probabilities, which are just real numbers. 

As a direct consequence of Markov’s inequality, convergence in mean square implies convergence 
in probability. 


Theorem 6.1.5. Convergence in mean square implies convergence in probability. 
Proof. We have 


lim P 

i—>oo 


X-X (i) 


> e) = lim P ( [X — X (i)) > P 

i—>oo 


< lim - 

i—> oo 

= 0 , 

if the sequence converges in mean square. 


E ( (X — X (i 


by Markov’s inequality 


( 6 . 8 ) 

(6.9) 

( 6 . 10 ) 

□ 


It turns out that convergence with probability one also implies convergence in probability. Con¬ 
vergence in probability one does not imply convergence in mean square or vice versa. The 
difference between these three types of convergence is not very important for the purposes of 
this course. 


6.1.3 Convergence in distribution 

In some cases, a random process X does not converge to the value of any random variable, but 
the cdf of X ( i ) converges pointwise to the cdf of another random variable X. In that case, 
the actual values of X (i) and X are not necessarily close, but in the limit they have the same 
distribution. In this case, we say that X converges in distribution to X. 

Definition 6.1.6 (Convergence in distribution). A random process X converges in distribution 
to a random variable X belonging to the same probability space if 

lim F x(i) ( x ) = F * ( x ) 

i—>oo 

for all x G M where F\ is continuous. 


( 6 . 11 ) 



CHAPTER 6. CONVERGENCE OF RANDOM PROCESSES 


112 


Note that convergence in distribution is a much weaker notion than convergence with probability 
one, in mean square or in probability. If a discrete random process X converges to a random 
variable X in distribution, this only means that as i becomes large the distribution of X ( i ) tends 
to the distribution of X, not that the values of the two random variables are close. However, 
convergence in probability (and hence convergence with probability one or in mean square) does 
imply convergence in distribution. 

Example 6.1.7 (Binomial converges to Poisson). Let us define a discrete random process X (i) 
such that the distribution of X (i) is binomial with parameters i and p := X/i. X ( i ) and X (j ) 
are independent for i j, which completely characterizes the n-order distributions of the process 
for all n > 1. Consider a Poisson random variable X with parameter A that is independent of 
X ( i ) for all i. Do you expect the values of X and X ( i ) to be close as i — > oo? 

No! In fact even X (i) and X {i + 1) will not be close in general. However, X converges in 
distribution to X, as established in Example 2.2.8: 


lim j)w, (x) = lim ( ] p x (1 — p)^ 1 ^ 

i->oo V ' i^oo\xJ v ' 

(6.12) 

X x e~ x 

(6.13) 


XI 

= PX ( X ) . 

(6.14) 


A 


6.2 Law of large numbers 

Let us define the average of a discrete random process. 

Definition 6.2.1 (Moving average). The moving or running average A of a discrete random 
process X, defined for i = 1, 2,... (i.e. 1 is the starting point), is equal to 

l(z):=^X(j). (6.15) 

3 = 1 

Consider an iid sequence. A very natural interpretation for the moving average is that it is a 
real-time estimate of the mean. In fact, in statistical terms the moving average is the sample 
mean of the process up to time i (the sample mean is defined in Chapter 8). The law of large 
numbers establishes that the average does indeed converge to the mean of the iid sequence. 

Theorem 6.2.2 (Weak law of large numbers). Let X be an iid discrete random process with 
mean '■= P such that the variance of X (i) a 2 is bounded. Then the average A of X converges 
in mean square to p. 
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Proof. First, we establish that the mean of A (z) is constant and equal to fi , 


E U(f) = E U^Ta-o) 


i =i 


1=1 


= M- 


(6.16) 


(6.17) 

(6.18) 


Due to the independence assumption, the variance scales linearly in z. Recall that for indepen¬ 
dent random variables the variance of the sum equals the sum of the variances, 


Var (A (z)) =Var I (j) 


l=i 


^E Var ( X (J 


l=i 


o 

i 


We conclude that 


lim E ^ ^A ( i ) — /ij ^ = lim E (z) — E (*))) ^ by (6.18) 


= lim Var I A (z 


G 


= lim by (6.21) 
i —^oo i 

= 0. 


(6.19) 

( 6 . 20 ) 
( 6 . 21 ) 

( 6 . 22 ) 

(6.23) 

(6.24) 

(6.25) 

□ 


By Theorem 6.1.5 the average also converges to the mean of the iid sequence in probability. In 
fact, one can also prove convergence with probability one under the same assumptions. This 
result is known as the strong law of large numbers, but the proof is beyond the scope of these 
notes. We refer the interested reader to more advanced texts in probability theory. 

Figure 6.2 shows averages of realizations of several iid sequences. When the iid sequence is 
Gaussian or geometric we observe convergence to the mean of the distribution, however when the 
sequence is Cauchy the moving average diverges. The reason is that, as shown in Example 4.2.2, 
the Cauchy distribution does not have a well defined mean! Intuitively, extreme values have 
non-negligeable probability under the Cauchy distribution so from time to time the iid sequence 
takes values with very large magnitudes and this prevents the moving average from converging. 


6.3 Central limit theorem 

In the previous section we established that the moving average of a sequence of iid random 
variables converges to the mean of their distribution (as long as the mean is well defined and 
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Figure 6.2: Realization of the moving average of an iid standard Gaussian sequence (top), an iid 
geometric sequence with parameter p = 0.4 (center) and an iid Cauchy sequence (bottom). 
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the variance is finite). In this section, we characterize the distribution of the average A(i) as i 
increases. It turns out that A converges to a Gaussian random variable in distribution, which 
is very useful in statistics as we will see later on. 

This result, known as the central limit theorem, justifies the use of Gaussian distributions 
to model data that are the result of many different independent factors. For example, the 
distribution of height or weight of people in a certain population often has a Gaussian shape- 
as illustrated by Figure 2.13- because the height and weight of a person depends on many 
different factors that are roughly independent. In many signal-processing applications noise is 
well modeled as having a Gaussian distribution for the same reason. 

Theorem 6.3.1 (Central limit theorem). Let X be an iid discrete random, process with mean 
p^ := p such that the variance of X (i) cr 2 is bounded. The random process y/n(A — p), which 
corresponds to the centered and scaled moving average of X, converges in distribution to a 
Gaussian random variable with mean 0 and variance a 2 . 

Proof. The proof of this remarkable result is beyond the scope of these notes. It can be found 
in any advanced text on probability theory. However, we would still like to provide some 
intuition as to why the theorem holds. Theorem 3.5.2 establishes that the pdf of the sum of two 
independent random variables is equal to the convolutions of their individual pdfs. The same 
holds for discrete random variables: the pmf of the sum is equal to the convolution of the pmfs, 
as long as the random variables are independent. 

If each of the entries of the iid sequence has pdf /, then the pdf of the sum of the first i elements 
can be obtained by convolving / with itself i times 

/ E i =i x{j) ( x ) = (/*/*•••*/) (®) • (6.26) 

If the sequence has a discrete state and each of the entries has pmf p, the pmf of the sum of the 
first i elements can be obtained by convolving p with itself i times 

P £j=i X(j) ( x ) = (P * P * ''' * P) ( x ) • (6-27) 

Normalizing by i just results in scaling the result of the convolution, so the pmf or pdf of the 
moving mean A is the result of repeated convolutions of a fixed function. These convolutions 
have a smoothing effect, which eventually transforms the pmf/pdf into a Gaussian! We show 
this numerically in Figure 6.3 for two very different distributions: a uniform distribution and a 
very irregular one. Both converge to Gaussian-like shapes after just 3 or 4 convolutions. The 
central limit theorem makes this precise, establishing that the shape of the pmf or pdf becomes 
Gaussian asymptotically. □ 

In statistics the central limit theorem is often invoked to justify treating averages as if they have a 
Gaussian distribution. The idea is that for large enough n \fn{A — p) is approximately Gaussian 
with mean 0 and variance a 2 , which implies that A is approximately Gaussian with mean p and 
variance u 2 /n. It’s important to remember that we have not established this rigorously. The 
rate of convergence will depend on the particular distribution of the entries of the iid sequence. 

In practice convergence is usually very fast. Figure 6.4 shows the empirical distribution of the 
moving average of an exponential and a geometric iid sequence. In both cases the approximation 
obtained by the central limit theory is very accurate even for an average of 100 samples. The 
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Figure 6.3: Result of convolving two different distributions with themselves several times. The shapes 
quickly become Gaussian-like. 
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Exponential with A = 2 (iid) 
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Figure 6.4: Empirical distribution of the moving average of an iid standard Gaussian sequence (top), 
an iid geometric sequence with parameter p = 0.4 (center) and an iid Cauchy sequence (bottom). The 
empirical distribution is computed from 10 4 samples in all cases. For the two first rows the estimate 
provided by the central limit theorem is plotted in red. 
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figure also shows that for a Cauchy iid sequence, the distribution of the moving average does 
not become Gaussian, which does not contradict the central limit theorem as the distribution 
does not have a well defined mean. To close this section we derive a useful approximation to 
the binomial distribution using the central limit theorem. 

Example 6.3.2 (Gaussian approximation to the binomial distribution). Let X have a binomial 
distribution with parameters n and p, such that n is large. Computing the probability that X is 
in a certain interval requires summing its pmf over all the values in that interval. Alternatively, 
we can obtain a quick approximation using the fact that for large n the distribution of a binomial 
random variable is approximately Gaussian. Indeed, we can write X as the sum of n independent 
Bernoulli random variables with parameter p, 

n 

X = ^Bi. (6.28) 

i =1 


The mean of Bi is p and its variance is p(l — p). By the central limit theorem is approx¬ 
imately Gaussian with mean p and variance p( 1 — p) /n. Equivalently, by Lemma 2.5.1, X is 
approximately Gaussian with mean np and variance np (1 — p). 

Assume that a basketball player makes each shot she takes with probability p = 0.4. If we 
assume that each shot is independent, what is the probability that she makes more than 420 
shots out of 1000? We can model the shots made as a binomial X with parameters 1000 and 
0.4. The exact answer is 


1000 


P {X > 420) = Y, P* (*) 

'1000 


a ;=420 

1000 


= E 

rr =420 

= 10.4 10“ 2 


x 


0A x 0.6 (n ~ x) 


(6.29) 

(6.30) 

(6.31) 


If we apply the Gaussian approximation, by Lemma 2.5.1 X being larger than 420 is the same 
as a standard Gaussian U being larger than 42 ( ^T At where p and er are the mean and standard 
deviation of X, equal to np = 400 and y/np( 1 — p) = 15.5 respectively. 


P (X > 420) « P (y/np(l-p)U + np > 42o) (6.32) 

= P (17 > 1.29) (6.33) 

= 1 — $ (1.29) (6.34) 

= 9.85 10 -2 . (6.35) 


A 


6.4 Monte Carlo simulation 

Simulation is a powerful tool in probability and statistics. Probabilistic models are often too 
complex for us to derive closed-form solutions of the distribution or expectation of quantities of 
interest, as we do in homework problems. 
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As an example, imagine that you set up a probabilistic model to determine the probability of 
winning a game of solitaire. If the cards are well shuffled, this probability equals 

Number of permutations that lead to a win 
Total number 

The problem is that characterizing what permutations lead to a win is very difficult without 
actually playing out the game to see the outcome. Doing this for every possible permutation is 
computationally intractable, since there are 52! ~ 8 10 67 of them. However, there is a simple way 
to approximate the probability of interest: simulating a large number of games and recording 
what fraction result in wins. The game of solitaire was precisely what inspired Stanislaw Ularn 
to propose simulation-based methods, known as the Monte Carlo method (a code name, inspired 
by the Monte Carlo Casino in Monaco), in the context of nuclear-weapon research in the 1940s: 

The first thoughts and attempts I made to practice (the Monte Carlo Method) were suggested 
by a question which occurred to me in 19f 6 as I was convalescing from an illness and playing 
solitaires. The question was what are the chances that a Canfield solitaire laid out with 52 
cards will come out successfully? After spending a lot of time trying to estimate them by pure 
combinatorial calculations, I wondered whether a more practical method than ’’abstract thinking” 
might not be to lay it out say one hundred times and simply observe and count the number of 
successful plays. 

This was already possible to envisage with the beginning of the new era of fast computers, and 
I immediately thought of problems of neutron diffusion and other questions of mathematical 
physics, and more generally how to change processes described by certain differential equations 
into an equivalent form interpretable as a succession of random operations. Later, I described 
the idea to John von Neumann, and we began to plan actual calculations. 1 

Monte Carlo methods use simulation to estimate quantities that are challenging to compute 
exactly. In this section, we consider the problem of approximating the probability of an event 
£, as in the game of solitaire example. 

Algorithm 6.4.1 (Monte Carlo approximation). To approximate the probability of an event £, 
we: 

1. Generate n independent samples from the indicator function Is associated to the event: 

I\, I2 ; • • ■ , In- 

2. Compute the average of the n samples 

1 n 

A (n) :=~Y h (6.37) 

n 

2—1 

which is the estimate for the probability of £ 

The probability of interest can be interpreted as the expectation of the indicator function 1£ 
associated to the event, 

E(1 £ ) = P(£). (6.38) 

By the law of large numbers, the estimate A converges to the true probability as n —> oo. The 
following example illustrates the power of this simple technique. 


(6.36) 


P (Win) 


Tttp://en.Wikipedia.org/wiki/Monte_Carlo_method#History 
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Gan 

1-2 
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1-3 
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Ri 

Rank 

R2 

R-s 

Probability 

1 

1 

2 

l 

2 

3 

1/6 

1 

1 

3 

l 

3 

2 

1/6 

1 

3 

2 

l 

1 

1 

1/12 

1 

3 

3 

2 

3 

1 

1/12 

2 

1 

2 

2 

1 

3 

1/6 

2 

1 

3 

1 

1 

1 

1/6 

2 

3 

2 

3 

1 

2 

1/12 

2 

3 

3 

3 

2 

1 

1/12 


Probability mass function 



R\ 

Ra 

Rs 

1 

7/12 

1/2 

5/12 

2 

1/4 

1/4 

1/4 

3 

1/6 

1/4 

1/3 


Table 6.1: The table on the left shows all possible outcomes in a league of three teams (m = 3), the 
resulting ranks for each team and the corresponding probability. The table on the right shows the pmf 
of the ranks of each of the teams. 


Example 6.4.2 (Basketball league). In an intramural basketball league m teams play each other 
once every season. The teams are ordered according to their past results: team 1 being the best 
and team m the worst. We model the probability that team i beats team j, for 1 < i < j < m 
as 


P (team j beats team i) 


1 

j ~ i + 1 ’ 


(6.39) 


The best team beats the second with probability 1/2 and the third with probability 2/3, the 
second beats the third with probability 1/2, the fourth with probability 2/3 and the fifth with 
probability 3/4, and so on. We assume that the outcomes of the different games are independent. 

At the end of the season, after every team has played with every other team, the teams are 
ranked according to their number of wins. If several teams have the same number of wins, then 
they share the same rank. For example, if two teams have the most wins, they both have rank 
1, and the next team has rank 3. The goal is to compute the distribution of the final rank of 
each team in the league, which we model as the random variables R .\. i? 2 , ..., R m . We have all 
the information to compute the joint pmf of these random variables by applying the law of total 
probability. As shown in Table 6.1 for m = 3, all we need to do is enumerate all the possible 
outcomes of the games and sum the probabilities of the outcomes that result in a particular 
rank. 

Unfortunately, the number of possible outcomes grows dramatically with m. The number of 
games equals m(m —l)/2, so the possible outcomes are 2 m ( m_1 )/ 2 . When there are just 
10 teams, this is larger than 10 13 . Computing the exact distribution of the final ranks for 
leagues that are not very small is therefore very computationally demanding. Fortunately, Al¬ 
gorithm 6.4.1 offers a more tractable alternative: We can sample a large number of seasons n by 
simulating each game as a Bernoulli random variable with a parameter given by equation (6.39) 
and approximate the pmf using the fraction of times that each team ends up in each position. 
Simulating a whole season only requires sampling m (m — 1) /2 games, which can be done very 
fast. 

Table 6.2 illustrates the Monte Carlo approach for m = 3. The approximation is quite coarse if 
we only use n = 10 simulated seasons, but becomes very accurate when n = 2,000. Figure 6.5 
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Estimated pmf (n = 10) 



Ri 

R2 

Rs 

1 

0.6 (0.583) 

0.7 (0.5) 

0.3 (0.417) 

2 

0.1 (0.25) 

0.2 (0.25) 

0.4 (0.25) 

3 

0.3 (0.167) 

0.1 (0.25) 

0.3 (0.333) 


Estimated pmf (n = 2, 000) 



Ri 

R2 

Rs 

1 

0.582 (0.583) 

0.496 (0.5) 

0.417 (0.417) 

2 

0.248 (0.25) 

0.261 (0.25) 

0.244 (0.25) 

3 

0.171 (0.167) 

0.245 (0.25) 

0.339 (0.333) 


Table 6.2: The table on the left shows 10 simulated outcomes of a league of three teams (m = 3) and 
the resulting ranks. The tables on the right show the estimated pmf obtained by Monte Carlo simulation 
from the simulated outcomes on the left (top) and from 2,000 simulated outcomes (bottom). The exact 
values are included in brackets for comparison. 



Number of teams m 


Monte Carlo error 


m 

Average error 
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Figure 6.5: The graph on the left shows the time needed to obtain the exact pmf of the final ranks 
in Example 6.4.2 and to approximate them by Monte Carlo approximation using 2,000 simulated league 
outcomes. The table on the right shows the average error per entry of the Monte Carlo approximation. 
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Figure 6.6: Approximate pmf of the final ranks in Example 6.4.2 using 2,000 simulated league outcomes. 


shows the running time needed to compute the exact pmf and to approximate it with the Monte 
Carlo approach for different numbers of teams. When the number of teams is very small the 
exact computation is very fast, but the running time increases exponentially with m as expected, 
so that for 7 teams the computation already takes 5 and a half minutes. In contrast, the Monte 
Carlo approximation is dramatically faster. For m = 20 it just takes half a second. Figure 6.6 
shows the approximate pmf of the final ranks for 5, 20 and 100 teams. Higher ranks have higher 
probabilities because when two teams are tied they are awarded the higher rank. 
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Chapter 7 

Markov Chains 


The Markov property is satisfied by any random process for which the future is conditionally 
independent from the past given the present. 

Definition 7.0.1 (Markov property). A random process satisfies the Markov property if X (fj+i) 
is conditionally independent of X (t ±),..., X (U-i) given X (tf) for any t\ < £2 < ■ • ■ < U < U + i. 
If the state space of the random process is discrete, then for any x,\, xi, ■ ■ ■, x t +i 

Px(t n+1 ) |X(ti),X(t 2 ),...,V(ti) ( x n+l\xi,X 2 , ■ ■ ■ ,X n ) = PX(t i+1 )\Xfa) ( x i+l\ x i) ■ (7^) 

If the state space of the random process is continuous (and the distribution has a joint pdf), 

fX(ti +1 )\X(ti),X(tn),...,X(ti) ( x *+il x i> x 2, ■ ■ ■ ,Xi) = fx(t i+1 )\X(u) ( x *+il x i) ■ (7-2) 

Figure 7.1 shows the directed graphical model that corresponds to the dependence assumptions 
implied by the Markov property. Any iid sequence satisfies the Markov property, since all 
conditional pmfs or pdfs are just equal to the marginals (in this case there would be no edges 
in the directed acyclic graph of Figure 7.1). The random walk also satisfies the property, since 
once we fix where the walk is at a certain time i the path that it took before i has no influence 
in its next steps. 

Lemma 7.0.2. The random walk satisfies the Markov property. 

Proof. Let X denote the random walk defined in Section 5.6. Conditioned on X (j) = Xi for j < 
i , X [i + 1) equals Xi + S (i + 1). This does not depend on x\, ..., X{-\, which implies (7.1). □ 

7.1 Time-homogeneous discrete-time Markov chains 

A Markov chain is a random process that satisfies the Markov property. Here we consider 
discrete-time Markov chains with a finite state space, which means that the process can 
only take a finite number of values at any given time point. To specify a Markov chain, we only 
need to define the pmf of the random process at its starting point (which we will assume is at 
i = 0) and its transition probabilities. This follows from the Markov property, since for any 
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Figure 7.1: Directed graphical model describing the dependence assumptions implied by the Markov 
property. 


n > 0 


n 

Px(0),X(l),...,X(n) ( X °’ Xl ’ ' ' ' ’ Xn ) ' = n^X(i)|X(0),...,X(j-l) ( x i\ x ®i ' • ■ > x i-l) C^) 

i=0 

n 

= (7,4) 
i=0 

If these transition probabilities are the same at every time step (i.e. they are constant and do 
not depend on i), then the Markov chain is said to be time homogeneous. In this case, we can 
store the probability of each possible transition in an s x s matrix T ^, where s is the number of 
states. 


( T x)jk Px{i+l)\x{i) l Xfc ) ‘ C'"’ 5 ) 

In this chapter we focus on time-homogeneous finite-state Markov chains. The transition prob¬ 
abilities of these chains can be visualized using a state diagram, which shows each state and the 
probability of every possible transition. See Figure 7.2 below for an example. The state diagram 
should not be confused with the directed acyclic graph (DAG) that represents the dependence 
structure of the model, illustrated in Figure 7.1. In the state diagram, each node corresponds to 
a state and the edges to transition probabilities between states, whereas the DAG just indicates 
the dependence structure of the random process in time and is usually the same for all Markov 
chains. 

To simplify notation we define an s-dimensional vector 'Px(i) ca lfod the state vector, which 
contains the marginal pmf of the Markov chain at each time i. 


Px(i) ( x 'i) 
Px(i) (* 2 ) 


(7.6) 


Px(i) ( x *l 


Each entry in the state vector contains the probability that the Markov chain is in that particular 
state at time i. It is not the value of the Markov chain, which is a random variable. 

The initial state space Px(o) an ^ the transition matrix suffice to completely specify a time- 
homogeneous finite-state Markov chain. Indeed, we can compute the joint distribution of the 
chain at any n time points *i, * 2 , ■ ■ ■, i n for any n > 1 from P^d)) ai fo ^x ^7 a PPlyi R g (7.4) and 
marginalizing over any times that we are not interested in. We illustrate this in the following 
example. 
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0.8 




0 5 10 15 0 5 10 15 0 5 10 15 


Customer Customer Customer 

Figure 7.2: State diagram of the Markov chain described in Example (7.1.1) (top). Each arrow shows 
the probability of a transition between the two states. Below we show three realizations of the Markov 
chain. 


Example 7.1.1 (Car rental). A car-rental company hires you to model the location of their 
cars. The company operates in Los Angeles, San Francisco and San Jose. Customers regularly 
take a car in a city and drop it off in another. It would be very useful for the company to be 
able to compute how likely it is for a car to end up in a given city. You decide to model the 
location of the car as a Markov chain, where each time step corresponds to a new customer 
taking the car. The company allocates new cars evenly between the three cities. The transition 
probabilities, obtained from past data, are given by 


Francisco 

Los Angeles 

San Jose 


0.6 

0.1 

0.3 \ 

San Francisco 

0.2 

0.8 

0.3 

Los Angeles 

0.2 

0.1 

0.4 / 

San Jose 


To be clear, the probability that a customer moves the car from San Francisco to LA is 0.2, the 
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probability that the car stays in San Francisco is 0.6, and so on. 

The initial state vector and the transition matrix of the Markov chain are 



1/3 


0.6 0.1 0.3 

Px{ 0 ) : 

1/3 

T~ ■= 

’ ± x ■ 

0.2 0.8 0.3 


1/3 


0.2 0.1 0.4 


(7.7) 


State 1 is assigned to San Francisco , state 2 to Los Angeles and state 3 to San Jose. Figure 7.2 
shows a state diagram of the Markov chain. Figure 7.2 shows some realizations of the Markov 
chain. 

The company wants to find out the probability that the car starts in San Francisco, but is in 
San Jose right after the second customer. This is given by 


Px (0) ,X (2) 3) £ ^X(0),X(1),X(2) (■*■> *> 3) 

i =1 
3 

= £p*(0) ^)Px{ 1)|X(0) (*|l)Px(2)|X(l) 0*K) 


%— 1 


, E PAi (+) 


3 i 


1=1 


0.6-0.2+ 0.2-0.1+ 0.2-0.4 _ ? ^ iq _ 2 


The probability is 7.33%. 


(7.8) 

(7.9) 

(7.10) 

(7.11) 

A 


The following lemma provides a simple expression for the state vector at time i 'Px(i) terms 
of and the previous state vector. 

Lemma 7.1.2 (State vector and transition matrix). For a Markov chain X with transition 
matrix T ~ 

Px(i) = T xPx(i-iy (7-12) 

If the Markov chain starts at time 0 then 


P X{i) T xPx(gy 

where T~ denotes multiplying i times by matrix T^. 


(7.13) 
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Proof. The proof follows directly from the definitions, 



Px(i) ( X l) 


Px { i) ■■= 

Px(i) ( X 2) 

= 


Px(i) ( x *)_ 



Ylj=iPx(i-i) ( x o)Px{i)\x{i-i ) ( x il x i) 
Ylj=lPx(i- 1) ( x i)Px(i)|A(«-i) ( x 2|*j) 


(7.14) 


Px[i)\X{i- 1) ( X 1 x l) 

Px(i)\X(i^l) ( Xl l X2 ) 

Px(i)\X(i—l) ( Xl l x s) 


i- 

7 

•<s> 

Px(i)\X(i- 1) ( x 2 x ’l) 

Px(i)\X(i-l) ( X2 X2 ) 

Px(i)\X(i- l)( X 2 l x s) 


Px(i- 1 ) M 

_Px(i)\X(i-l) ( x s x l) 

Px(i)\X(i-l) ( Xs X2 ) 

Px{i)\X(i-l) ( x s\ x s)_ 


Px(i- 1) ( x ^)_ 


= T 


xPx(i- 1) 


(7.15) 


Equation (7.13) is obtained by applying (7.12) i times and taking into account the Markov 
property. □ 

Example 7.1.3 (Car rental (continued)). The company wants to estimate the distribution of 
locations right after the 5th customer has used a car. Applying Lemma 7.1.2 we obtain 


Px(5) T5 xPx ( o) 
0.281 
0.534 
0.185 


(7.16) 

(7.17) 


The model estimates that after 5 customers more than half of the cars are in Los Angeles. 


A 


7.2 Recurrence 

The states of a Markov chain can be classified depending on whether the Markov chain is 
guaranteed to always return to them or whether it may eventually stop visiting those states. 

Definition 7.2.1 (Recurrent and transient states). Let X be a time-homogeneous finite-state 
Markov chain. We consider a particular state x. If 

P (j) = s for some j > i \ X ( i ) = s\ = 1 (7.18) 

then the state is recurrent. In words, given that the Markov chain is at x, the probability that 
it returns to x is one. In contrast, if 

P (x (j) s for all j > i \ X (i) = s) > 0 (7.19) 

the state is transient. Given that the Markov chain is at x, there is nonzero probability that it 
will never return. 
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Figure 7.3: State diagram of the Markov chain described in Example (7.2.2) (top). Below we show 
three realizations of the Markov chain. 
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The following example illustrates the difference between recurrent and transient states. 

Example 7.2.2 (Employment dynamics). A researcher is interested in modeling the employ¬ 
ment dynamics of young people using a Markov chain. 

She determines that at age 18 a person is either a student with probability 0.9 or an intern with 
probability 0.1. After that she estimates the following transition probabilities: 


Student 

Intern 

Employed 

0.8 

0.5 

0 

0.1 

0.5 

0 

0.1 

0 

0.9 

0 

0 

0.1 


Unemployed 


0 

0 

0.4 

0.6 


Student 

Intern 

Employed 

Unemployed 


The Markov assumption is obviously not completely precise, someone who has been a student 
for longer is probably less likely to remain a student, but such Markov models are easier to fit 
(we only need to estimate the transition probabilities) and often yield useful insights. 

The initial state vector and the transition matrix of the Markov chain are 


Px {0 ) : 


0.9 


0.8 

0.5 

0 

0 

0.1 

T~ ■= 

’ ± x ■ 

0.1 

0.5 

0 

0 

0 


0.1 

0 

0.9 

0.4 

0 


0 

0 

0.1 

0.6 


(7.20) 


Figure 7.3 shows the state diagram and some realizations of the Markov chain. 

States 1 (student) and 2 (intern) are transient states. Note that the probability that the Markov 


chain returns to those states after visiting state 3 (employed) is zero, so 

P (x (j) ± 1 for all j >i \ X (■ i ) = l) > P (x (i + 1) = 3 \X (i) = l) (7.21) 

= 0.1 > 0, (7.22) 

P (x (j) / 2 for all j > i \ X (i) = 2) > P (x (i + 2) = 3 \X (*) = 2) (7.23) 

= 0.5- 0.1 >0. (7.24) 


In contrast, states 3 and 4 (unemployed) are recurrent. We prove this for state 3 (the argument 


for state 4 is exactly the same): 

P (x ( j ) + 3 for all j>i\X (*) = 3) (7.25) 

= P (x ( j ) = 4 for all j > i \X (*) = 3) (7.26) 

k 

= lim P [x (i + 1) = 4 \X (i) = 3) Y[ P (x (i + j + 1) = 4 \X (i + j ) = 4) (7.27) 

k ^°° j =1 

= lim 0.1 • 0.6 fc (7.28) 

fc—> OO 

= 0. (7.29) 
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A 

In this example, it is not possible to reach the states student and intern from the states employed 
or unemployed. Markov chains for which there is a possible transition between any two states 
(even if it is not direct) are called irreducible. 

Definition 7.2.3 (Irreducible Markov chain). A time-homogeneous finite-state Markov chain 
is irreducible if for any state x, the probability of reaching every other state y x in a finite 
number of steps is nonzero, i.e. there exists m> 0 such that 

P ^A" (i + m) = y \ X (i) = xj >0. (7.30) 

One can easily check that the Markov chain in Example 7.1.1 is irreducible, whereas the one 
in Example 7.2.2. An important result is that all states in an irreducible Markov chain are 
recurrent. 

Theorem 7.2.4 (Irreducible Markov chains). All states in an irreducible Markov chain are 
recurrent. 

Proof. In any finite-state Markov chain there must be at least one state that is recurrent. If 
all the states are transient there is a nonzero probability that it leaves all of the states forever, 
which is not possible. Without loss of generality let us assume that state x is recurrent. We now 
provide a sketch of a proof that another arbitrary state y must also be recurrent. To alleviate 
notation let 

Px,x ■= P (x ( j ) = x for some j > i \ X (i) = xj , (7.31) 

Px, y ■= P (x ( j) = y for some j > i \ X (i) = x'j , (7.32) 

Py,x ■= P (AT (k) = x for some j > i \ X (i) = y\ . (7.33) 

The chain is irreducible so there is a nonzero probability p m > 0 of reaching y from x in at most 

m steps for some m > 0. The probability that the chain goes from x to y and never goes back 
to x is consequently at least p m (1 — p y , x )- However, x is recurrent, so this probability must be 
zero! Since p rn > 0 this implies p V)X = 1. 

Consider the following event: 

1. X goes from y to x. 

2. X does not return to y in m steps after reaching x. 

3. X eventually reaches x again at a time m! > m. 

The probability of this event is equal to p VtX (1 — p m ) p X}X = 1 — p m (recall that x is recurrent so 
Px,x = 1). Now imagine that steps 2 and 3 repeat k times, i.e. that X fails to go from x to y in 
m steps k times. The probability of this event is p y)X (1 — p m ) k P XX = ~ Pm) k ■ Taking k —> oo 

this is equal to zero for any m so the probability that X does not eventually return to x must 
be zero (this can be made rigorous, but the details are beyond the scope of these notes). □ 
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Figure 7.4: State diagram of a Markov chain where states the states have period two. 


7.3 Periodicity 

Another important consideration is whether the Markov chain always visits a given state at 
regular intervals. If this is the case, then the state has a period greater than one. 

Definition 7.3.1 (Period of a state). Let X be a time-homogeneous finite-state Markov chain 
and x a state of the Markov chain. The period m of x is the largest integer such that it is only 
possible to return to x in a number of steps that is a multiple of m, i.e. we can only return in 
km steps with nonzero probability where k is a positive integer. 

Figure 7.4 shows a Markov chain where the states have a period equal to two. Aperiodic Markov 
chains do not contain states with periods greater than one. 

Definition 7.3.2 (Aperiodic Markov chain). A time-homogeneous finite-state Markov chain X 
is aperiodic if all states have period equal to one. 

The Markov chains in Examples 7.1.1 and 7.2.2 are both aperiodic. 

7.4 Convergence 

In this section we study under what conditions a finite-state time-homogeneous Markov chain 
X converges in distribution. If a Markov chain converges in distribution, then its state vector 
Px(t)■ w hich contains the first order pmf of X, converges to a fixed vector p^, 

P°o--=£™Px (iy (7-34) 

In that case the probability of the Markov chain being in each state eventually tends to a fixed 
value (which does not imply that the Markov chain will stay at a given state!). 

By Lemma 7.1.2 we can express (7.34) in terms of the initial state vector and the transition 
matrix of the Markov chain 


P°° = ,/i“ T i^Y(o)- (7.35) 

Computing this limit analytically for a particular T^ and Px(o) ma y seem challenging at first 
sight. However, it is often possible to leverage the eigendecomposition of the transition matrix 
(if it exists) to find p^. This is illustrated in the following example. 
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Figure 7.5: State diagram of the Markov chain described in Example (7.4.1) (top). Below we show 
three realizations of the Markov chain. 


Example 7.4.1 (Mobile phones). A company that makes mobile phones wants to model the 
sales of a new model they have just released. At the moment 90% of the phones are in stock, 
10% have been sold locally and none have been exported. Based on past data, the company 
determines that each day a phone is sold with probability 0.2 and exported with probability 0.1. 
The initial state vector and the transition matrix of the Markov chain are 


0.9 


0.7 

0 

0 

0.1 

II 

0.2 

1 

0 

0 


0.1 

0 

1 


(7.36) 


We have used a to denote Px(o) because later we will consider other possible initial state vectors. 
Figure 7.6 shows the state diagram and some realizations of the Markov chain. 

The company is interested in the fate of the new model. In particular, it would like to compute 
what fraction of mobile phones will end up exported and what fraction will be sold locally. This 











CHAPTER 7. MARKOV CHAINS 


133 


b 


1.0 
0.8 
0.6 
0.4 
0.2 
0.0 

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 

Day Day Day 

Figure 7.6: Evolution of the state vector of the Markov chain in Example (7.4.1) for different values of 
the initial state vector Px^ 0 y 
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is equivalent to computing 


= lim T~ a. 

i^-oo -X- 


(7.37) 

(7.38) 


The transition matrix Ty- has three eigenvectors 



O' 


'O' 


' 0.80 ' 

Qi ■= 

0 

, <?2 ■= 

1 

, Q3 ■ = 

-0.53 


1 


0 


-0.27 


(7.39) 


The corresponding eigenvalues are Ai := 1, A 2 := 1 and A 3 := 0.7. We gather the eigenvectors 
and eigenvalues into two matrices 


Q ■= [qi < 73 ] , 


A := 


Ai 

0 

0 


0 0 
A2 0 
0 A 3 


(7.40) 


so that the eigendecomposition of is 


:= QAQ" 1 . 


(7.41) 


It will be useful to express the initial state vector a in terms of the different eigenvectors. This 
is achieved by computing 


so that 


Q Px ( 0 ) — 


' 0.3' 
0.7 
1.122 


(7.42) 


a = 0.3 q\ + 0.7 q 2 + 1.122 53 . 


( 7 . 43 ) 
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We conclude that 

lim T~ a = lim T~ (0.3 q\ + 0.7 q 2 + 1.122 <f 3 ) (7.44) 

i —^oo i—yoo A 

= lim 0.3 Ti qi + 0.7T- q 2 + 1.122T- q 3 (7.45) 

i —^OO A A A 

= lim 0.3 Xl qi + 0.7 A 2 * g 2 + 1.122 A 3 * g 3 (7.46) 

i—¥ oo 

= lim 0.3 <fi + 0.7 02 + 1-122 0.5 * g 3 (7.47) 

i—>oo 

= 0.3 qi + 0.7 q 2 (7.48) 

' 0 ' 

= 0.7 . (7.49) 

0.3 


This means that eventually the probability that each phone has been sold locally is 0.7 and the 
probability that it has been exported is 0.3. The left graph in Figure 7.6 shows the evolution of 
the state vector. As predicted, it eventually converges to the vector in equation (7.49). 

In general, because of the special structure of the two eigenvectors with eigenvalues equal to one 
in this example, we have 


lim T. 


x 


p x( o) 


0 

(Q 1 ^( 0)) 2 

_{Q ''Px(o)) u 


(7.50) 


This is illustrated in Figure 7.6 where you can see the evolution of the state vector if it is 
initialized to these other two distributions: 


b:= 

1 _ 

, Q~ 1 b = 

1 

o o 
b* 

_i 


0.4 


_0.75_ 


'0.4' 


'0.23' 

c := 

0.5 

, Q~ 1 c = 

0.77 


0.1 


0.50 


(7.51) 


(7.52) 


A 


The transition matrix of the Markov chain in Example 7.4.1 has two eigenvectors with eigenvalue 
equal to one. If we set the initial state vector to equal either of these eigenvectors (note that we 
must make sure to normalize them so that the state vector contains a valid pmf) then 

T xPx(o)=Px(oy (7.53) 


so that 


Px(i) T x Px{ o) 

= Px{ o) 


(7.54) 

(7.55) 
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for all i. In particular, 

£“^(0=^Y(0)’ ( 7 - 56 ) 

so X converges to a random variable with pmf Py(o) distribution. A distribution that satis¬ 
fies (7.56) is called a stationary distribution of the Markov chain. 

Definition 7.4.2 (Stationary distribution). Let X be a finite-state time-homogeneous Markov 
chain and let p stat be a state vector containing a valid pmf over the possible states of X. If p stat 
is an eigenvector associated to an eigenvalue equal to one, so that 

Tx Pstat = Pstat ; (7.57) 

then the distribution corresponding to p s t a t is a stationary or steady-state distribution of X. 

Establishing whether a distribution is stationary by checking whether (7.57) holds may be chal¬ 
lenging computationally if the state space is very large. We now derive an alternative condition 
that implies stationarity. Let us first define reversibility of Markov chains. 

Definition 7.4.3 (Reversibility). Let X be a finite-state time-homogeneous Markov chain with 
s states and transition matrix TV. Assume that X (i) is distributed according to the state vector 
pe ML If 

P (^X (z) = Xj, X (z + 1) = x2j = P (z) = Xk,X (z + 1) = Xj^J , for all 1 < j,k < s, (7.58) 

then we say that X is reversible with respect to p. This is equivalent to the detailed-balance 
condition 

i T x)kj& = ( T x)jkP k ’ f° r al1 1 < T k < s. (7.59) 

As proved in the following theorem, reversibility implies stationarity, but the converse does not 
hold. A Markov chain is not necessarily reversible with respect to a stationary distribution (and 
often will not be). The detailed-balance condition therefore only provides a sufficient condition 
for stationarity. 

Theorem 7.4.4 (Reversibility implies stationarity). If a time-homogeneous Markov chain X is 
reversible with respect to a distribution px, then px is a stationary distribution of X. 


Proof. Let p be the state vector containing p\- By assumption Tjj and p satisfy (7.59), so for 
1 <3 <s 


P>iO 


3 


s 


( T x) jk P k 

k =1 

s 

(7.60) 

^2 ( T x)kjPj 

k= 1 

s 

(7.61) 

Pi 22 ( T x)kj 

k= 1 

(7.62) 

Pi- 

(7.63) 
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Figure 7.7: Evolution of the state vector of the Markov chain in Example (7.4.7). 


The last step follows from the fact that the columns of a valid transition matrix must add to 
one (the chain always has to go somewhere). □ 

In Example 7.4.1 the Markov chain has two stationary distributions. It turns out that this is 
not possible for irreducible Markov chains. 

Theorem 7.4.5. Irreducible Markov chains have a single stationary distribution. 

Proof. This follows from the Perron-Frobenius theorem, which states that the transition ma¬ 
trix of an irreducible Markov chain has a single eigenvector with eigenvalue equal to one and 
nonnegative entries. □ 

If in addition, the Markov chain is aperiodic, then it is guaranteed to converge in distribution 
to a random variable with its stationary distribution for any initial state vector. Such Markov 
chains are called ergodic. 

Theorem 7.4.6 (Convergence of Markov chains). If a discrete-time time-homogeneous Markov 
chain X is irreducible and aperiodic its state vector converges to the stationary distribution p s tat 
of X for any initial state vector p^^. This implies that X converges in distribution to a random 
variable with pmf given by p s tat- 


The proof of this result is beyond the scope of these notes. 


Example 7.4.7 (Car rental (continued)). The Markov chain in the car rental example is irre¬ 
ducible and aperiodic. We will now check that it indeed converges in distribution. Its transition 
matrix has the following eigenvectors 


'0.273' 


'-0.577' 


'-0.577' 

0.545 

, 12 ■ = 

0.789 

II 

t w 

-0.211 

0.182 


-0.211 


0.789 


(7.64) 


The corresponding eigenvalues are Ai := 1, A 2 := 0.573 and A 3 := 0.227. As predicted by 
Theorem 7.4.5 the Markov chain has a single stationary distribution. 
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For any initial state vector, the component that is collinear with q\ will be preserved by the 
transitions of the Markov chain, but the other two components will become negligible after 
a while. The chain consequently converges in distribution to a random variable with pmf q\ 
(note that q\ has been normalized to be a valid pmf), as predicted by Theorem 7.4.6. This is 
illustrated in Figure 7.7. No matter how the company allocates the new cars, eventually 27.3% 
will end up in San Francisco, 54.5% in LA and 18.2% in San Jose. A 

7.5 Markov-chain Monte Carlo 

The convergence of Markov chains to a stationary distribution is very useful for simulating 
random variables. Markov-chain Monte Carlo (MCMC) methods generate samples from a target 
distribution by constructing a Markov chain in such a way that the stationary distribution equals 
the desired distribution. These techniques are of huge importance in modern statistics and in 
particular in Bayesian modeling. In this section we describe one of the most popular MCMC 
methods and illustrate it with a simple example. 

The key challenge in MCMC methods is to design an irreducible aperiodic Markov chain for 
which the target distribution is stationary. The Metropolis-Hastings algorithm uses an auxiliary 
Markov chain to achieve this. 

Algorithm 7.5.1 (Metropolis-Hastings algorithm). We store the pmf px of the target distri¬ 
bution in a vector p G M s , such that 

Pj ■= Px {xj) , 1 <j<s. (7.65) 

Let T denote the transition matrix of an irreducible Markov chain with the same state space 
{xi,... ,x s } as p. 

Initialize X (0) randomly or to a fixed state, then repeat the following steps for i = 1, 2, 3,.... 

1. Generate a candidate random variable C from X (i — 1) by using the transition matrix T, 
i.e. 

P (c = k | X (i - 1) = j'j = T kj , 1 <j,k<s. (7.66) 


2. Set 


X{i) 


c 

X(i-1 ) 


with probability p acc yX (i — 1), C 
otherwise , 


where the acceptance probability is defined as 


Pace (j, k) 


min 


Tjk Pk -j] 

Tkj Pj J 


1 < j,k < s. 


(7.67) 


(7.68) 


It turns out that this algorithm yields a Markov chain that is reversible with respect to the 
distribution of interest, which ensures that the distribution is stationary. 

Theorem 7.5.2. The pmf in p corresponds to a stationary distribution of the Markov chain X 
obtained by the Metropolis-Hastings algorithm. 
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Proof. We show that the Markov chain X is reversible with respect to p, i.e. that 

( T x)kj% ~ ( T x) jk ^ (7.69) 

holds for all 1 < j,k < s. This establishes the result by Theorem 7.4.4. The detailed-balanced 
condition holds trivially if j = k. If j ^ k we have 


(7~) fc .:=P(X (i) = k\X(i-l)=j) 

= P (x (i) = C, C = k | X (i - 1) = j) 

= P (x (■ i) = C\C = k, X (t - 1) = j) P (C = k | X (* - 1 ) = j) 

= Pace (j) A)) T k j 

and by exactly the same argument = p acc (A:, j) Tjk- We conclude that 


( T \-) fcj Pi = Pacc (j, fc) Tkj pj 

= T kj Pjm in{5^,l} 
l 7 kj Pj ) 

= min {T ifc T kj p 3 } 

= Pace (^5 j) Tjfc Pk 

= ( T x)jkP k ■ 


(7.70) 

(7.71) 

(7.72) 

(7.73) 

(7.74) 

(7.75) 

(7.76) 

(7.77) 

(7.78) 

(7.79) 

□ 


The following example is taken from Hastings’s seminal paper Monte Carlo Sampling Methods 
Using Markov Chains and Their Applications. 

Example 7.5.3 (Generating a Poisson random variable). Our aim is to generate a Poisson 
random variable X. Note that we don’t need to know the normalizing constant in the Poisson 
pmf, which equals to e A , as long as we know that it is proportional to 

Px (x) oc — (7.80) 

x\ 

The auxiliary Markov chain must be able to reach any possible value of X, i.e. all positive 
integers. We will use a modified random walk that takes steps upwards and downwards with 
probability 1/2, but never goes below 0. Its transition matrix equals 

2 if j = 0 and k = 0, 

\ if fc = j + 1, 

2 if j > 0 and k = j — 1, 

0 otherwise. 



(7.81) 
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T is symmetric so the acceptance probability is equal to the ratio of the pmfs: 


Pace (j) k~) 


:= min 


= min 


'T jk px (fc) 

, T kj PX ( j) 
PX (fc) 1 

PA-(j)’ J 


,1 


(7.82) 

(7.83) 


To compute the acceptance probability, we only consider transitions that are possible under the 
random walk. If j = 0 and k = 0 


Pace iji k) — 1. 


(7.84) 


If k = j + 1 


If k = j - 1 


Pace (j, j + 1) = min 


= mm 


Pace ( j , j - 1) = min 


A^'+i 

0+1)! 

Xi ’ 

1 (7.85) 

i- 

1 

j+'l 

(7.86) 

Al- 1 ' 

0-1)! 

Xi ’ 

> (7.87) 


) 

i-i 

(7.88) 


We now spell out the steps of the Metropolis-Hastings method. To simulate the auxiliary random 
walk we use a sequence of Bernoulli random variables that indicate whether the random walk 
is trying to go up or down (or stay at zero). We initialize the chain at xq = 0. Then, for 
i = 1 , 2 ,..we 

• Generate a sample b from a Bernoulli distribution with parameter 1/2 and a sample u 
uniformly distributed in [ 0 , 1 ]. 


• If b = 0: 


— If Xi-\ = 0, Xi := 0. 

— If Xi-1 > 0: 

* If u < Xi := Xi -1 — 1. 

* Otherwise Xi := x - t -\. 

• If 6= 1: 

- If U < x ._ X 1+1 , Xi ■.= Xi -1 + 1 . 

— Otherwise Xi := i. 
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Figure 7.8: Convergence in distribution of the Markov chain constructed in Example 7.8 for A := 6. 
To prevent clutter we only plot the empirical distribution of 6 states, computed by running the Markov 
chain 10 4 times. 


The Markov chain that we have built is irreducible: there is nonzero probability of going from 
any nonnegative integer to any other nonnegative integer (although it could take a while!). We 
have not really proved that the chain should converge to the desired distribution, since we have 
not discussed convergence of Markov chains with infinite state spaces, but Figure 7.8 shows that 
the method indeed allows to sample from a Poisson distribution with A := 6 . 

A 

For the example in Figure 7.8, approximate convergence in distribution occurs after around 100 
iterations. This is called the mixing time of the Markov chain. To account for it, MCMC 
methods usually discard the samples from the chain over an initial period known as burn-in 
time. 

The careful reader might be wondering about the point of using MCMC methods if we already 
have access to the desired distribution. It seems much simpler to just apply the method described 
in Section 2.6.1 instead. However, the Metropolis-Hastings method can be applied to discrete 
distributions with infinite supports and also to continuous distributions (justifying this is beyond 
the scope of these notes). Crucially, in contrast with inverse-transform and rejection sampling, 
Metropolis-Hastings does not require having access to the pmf px or pdf fx of the target 
distribution, but rather to the ratio px (x) /px (y) or fx (x) /fx (y) for every x j- y. This is 
very useful when computing conditional distributions within probabilistic models. 

Imagine that we have access to the marginal distribution of a continuous random variable A and 
the conditional distribution of another continuous random variable B given A. Computing the 
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conditional pdf 


fA\B («#) 


fA(a) fB\A{b\a) 
fu=- oo /a (“) /b|a (b\u) d u 


(7.89) 


is not necessary feasible due to the integral in the denominator. However, if we apply Metropolis- 
Hastings to sample from f&\B we don’t need to compute the normalizing factor since for any 
a\ / 02 


fA\B (ailft) _ fA(ai)fB\A(P\ai) 
/a|b(o2^) /a ( 02 ) Zb\A (b\ a 2) 


(7.90) 



Chapter 8 

Descriptive statistics 


In this chapter we describe several techniques for visualizing data, as well as for computing 
quantities that summarize it effectively. Such quantities are known as descriptive statistics. As 
we will see in the following chapters, these statistics can often be interpreted within a proba¬ 
bilistic framework, but they are also useful when probabilistic assumptions are not warranted. 
Because of this, we present them from a deterministic point of view. 

8.1 Histogram 

We begin by considering data sets containing one-dimensional data. One of the most natural 
ways of visualizing ID data is to plot their histogram. The histogram is obtained by binning the 
range of the data and counting the number of instances that fall within each bin. The width of 
the bins is a parameter that can be adjusted to yield higher or lower resolution. If we interpret 
the data as corresponding to samples from a random variable, then the histogram would be a 
piecewise constant approximation to their pmf or pdf. 

Figure 8.1 shows two histograms computed from temperature data gathered at a weather station 
in Oxford over 150 years. 1 Each data point represents the maximum temperature recorded in 
January or August of a particular year. Figure 8.2 shows a histogram of the GDP per capita of 
all countries in the world in 2014 according to the United Nations. 2 

8.2 Sample mean and variance 

Averaging the elements in a one-dimensional data set provides a one-number summary of the 
data, which is a deterministic counterpart to the mean of a random variable (recall that we are 
making no probabilistic assumptions in this chapter). This can be extended to multi-dimensional 
data by averaging over each dimension separately. 

Definition 8.2.1 (Sample mean). Let {xi,X 2 , • • • ,x n } be a set of real-valued, data. The sample 

1 The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/ 
oxforddata.txt. 

2 The data is available at http://unstats.un.org/unsd/snaama/selbasicFast.asp. 
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Figure 8.1: Histograms of temperature data taken in a weather station in Oxford over 150 years. Each 
data point equals the maximum temperature recorded in a certain month in a particular year. 



Thousands of dollars 


Figure 8.2: Histogram of the GDP per capita of all countries in the world in 2014. 
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Uncentered data 


Centered data 
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Figure 8.3: Effect of centering a two-dimensional data set. The axes are depicted using dashed lines. 


mean of the data is defined as 


av(xi,x 2 ,. 



( 8 . 1 ) 


Let {x\,X 2 , ■ ■ ■ ,x n } be a set of d-dimensional real-valued data vectors. 


The sample mean is 


&v(xi,X2,- 



( 8 . 2 ) 


The sample mean of the data in Figure 8.1 is 6.73 °C in January and 21.3 °C in August. The 
sample mean of the GDPs per capita in Figure 8.2 is $16,500. 

Geometrically, the average, also known as the sample mean, is the center of mass of the data. A 
common preprocessing step in data analysis is to center a set of data by subtracting its sample 
mean. Figure 8.3 shows an example. 

Algorithm 8.2.2 (Centering). Let x\, ..., x n be a set of d-dimensional data. To center the 
data set we: 


1. Compute the sample mean following Definition 8.2.1. 

2. Subtract the sample mean from each vector of data. For 1 < i < n 

Vi := Xi - av (fi,f 2 , • • • ,x n ) . (8.3) 

The resulting data set y\, ... , y n has sample mean equal to zero; it is centered at the origin. 

The sample variance is the average of the squared deviations from the sample mean. Geomet¬ 
rically, it quantifies the average variation of the data set around its center. It is a deterministic 
counterpart to the variance of a random variable. 











CHAPTER 8. DESCRIPTIVE STATISTICS 


145 


Definition 8.2.3 (Sample variance and standard deviation). Let {x\, x 2 , ■ ■ ■, x n } be a set of 
real-valued data. The sample variance is defined as 


1 n 

var (x\,x 2 , ...,x n ):= -- V] (x t - av (xi,x 2 , ■ ■ .,x n )Y 

n — 1 ■' 


i= 1 


The sample standard deviation is the square root of the sample variance 

std (xi,x 2 , ■ • •, x n ) := \/var (xi,x 2 , ■ ■ - ,x n ). 


(8.4) 


(8.5) 


You might be wondering why the normalizing constant is 1/ (n — 1) instead of 1/n. The reason 
is that this ensures that the expectation of the sample variance equals the true variance when 
the data are iid (see Lemma 9.2.5). In practice there is not much difference between the two 
normalizations. 

The sample standard deviation of the temperature data in Figure 8.1 is 1.99 °C in January and 
1.73 °C in August. The sample standard deviation of the GDP data in Figure 8.2 is $25,300. 


8.3 Order statistics 

In some cases, a data set is well described by its mean and standard deviation. 

In January the temperature in Oxford is around 6.73 ° C give or take 2° C. 

This a pretty accurate account of the temperature data from the previous section. However, 
imagine that someone describes the GDP data set in Figure 8.2 as: 

Countries typically have a GDP per capita of about $16 500 give or take $25 300. 

This description is pretty terrible. The problem is that most countries have very small GDPs 
per capita, whereas a few have really large ones and the sample mean and standard deviation 
don’t really convey this information. Order statistics provide an alternative description, which 
is usually more informative when there are extreme values in the data. 

Definition 8.3.1 (Quantiles and percentiles). Let x^ < xr 2 ) < • • • < xt n ) denote the ordered 
elements of a set of data {xi,x 2 ,..., x n }. The q quantile of the data for 0 < q < 1 is £([g( n +i)]), 
where [q (n + 1)] is the result of rounding q(n-\-l) to the closest integer. The 100 p quantile is 
known as the p percentile. 

The 0.25 and 0.75 quantiles are known as the first and third quartiles, whereas the 0.5 quantile 
is known as the sample median. A quarter of the data are smaller than the 0.25 quantile, half 
are smaller (or larger) than the median and three quarters are smaller than the 0.75 quartile. If 
n is even, the sample median is usually set to 

X(n/2) + g(n/2+l) ^ ^ 

The difference between the third and the first quartile is known as the interquartile range 

(IQR). 

It turns out that for the temperature data set in Figure 8.1 the sample median is 6.80 °C in 
January and 21.2 °C in August, which is essentially the same as the sample mean. The IQR is 
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Figure 8.4: Box plots of the Oxford temperature data set used in Figure 8.1. Each box plot corresponds 
to the maximum temperature in a particular month (January, April, August and November) over the 
last 150 years. 


2.9 °C in January and 2.1 °C in August. This gives a very similar spread around the median, as 
the sample mean. In this particular example, there does not seem to be an advantage in using 
order statistics. 

For the GDP data set, the median is $6,350. This means that half of the countries have a GDP 
of less than $6,350. In contrast, 71% of the countries have a GDP per capita lower than the 
sample mean! The IQR of these data is $18,200. To provide a more complete description of the 
data set, we can list a five-number summary of order statistics: the minimum the first 
quartile, the sample median, the third quartile and the maximum x^ n y For the GDP data set 
these are $130, $1,960, $6,350, $20,100, and $188,000 respectively. 

We can visualize the main order statistics of a data set by using a box plot, which shows the 
median value of the data enclosed in a box. The bottom and top of the box are the first and 
third quartiles. This way of visualizing a data set was proposed by the mathematician John 
Tukey. Tukey’s box plot also includes whiskers. The lower whisker is a line extending from the 
bottom of the box to the smallest value within 1.5 IQR of the first quartile. The higher whisker 
extends from the top of the box to the highest value within 1.5 IQR of the third quartile. Values 
beyond the whiskers are considered outliers and are plotted separately. 

Figure 8.4 applies box plots to visualize the temperature data set used in Figure 8.1. Each box 
plot corresponds to the maximum temperature in a particular month (January, April, August 
and November) over the last 150 years. The box plots allow us to quickly compare the spread 
of temperatures in the different months. Figure 8.5 shows a box plot of the GDP data from 
Figure 8.2. From the box plot it is immediately apparent that most countries have very small 
GDPs per capita, that the spread between countries increases for larger GDPs per capita and 
that a small number of countries have very large GDPs per capita. 
























CHAPTER 8. DESCRIPTIVE STATISTICS 


147 



Figure 8.5: Box plot of the GDP per capita of all countries in the world in 2014. Not all of the outliers 
are shown. 


8.4 Sample covariance 

In the previous sections we mostly considered data sets consisting of one-dimensional data 
(except when we discussed the sample mean of a multidimensional data set). In machine¬ 
learning lingo, there was only one feature per data point. We now study a multidimensional 
scenario, where there are several features associated to each data point. 

If the dimension of the data set equals to two (i.e. there are two features per data point), we 
can visualize the data using a scatter plot, where each axis represents one of the features. 
Figure 8.6 shows several scatter plots of temperature data. These data are the same as in 
Figure 8.1, but we have now arranged them to form two-dimensional data sets. In the plot on 
the left, one dimension corresponds to the temperature in January and the other dimension to 
the temperature in August (there is one data point per year). In the plot on the right, one 
dimension represents the minimum temperature in a particular month and the other dimension 
represents the maximum temperature in the same month (there is one data point per month). 
The sample covariance quantifies whether the two features of a two-dimensional data set tend 
to vary in a similar way on average, just as the covariance quantifies the expected joint variation 
of two random variables. 

Definition 8.4.1 (Sample covariance). Let {(aq, yi), ( X 2 , 2 / 2 ), • • •, ( x n , y n )} be a data set where 
each example consists of a measurement of two different features. The sample covariance is 
defined as 

1 n 

cov ((aq, yi) ,..., (x n ,y n )) := -- Y] (aq - av (aq,.. .,x n )) (y; - av (yi,.. .,y n )). (8.7) 

n — 1 

i =1 

In order to take into account that each individual feature may vary on a different scale, a common 
preprocessing step is to normalize each feature, dividing it by its sample standard deviation. 
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p = 0.269 p = 0.962 




Figure 8.6: Scatterplot of the temperature in January and in August (left) and of the maximum and 
minimum monthly temperature (right) in Oxford over the last 150 years. 


If we normalize before computing the covariance, we obtain the sample correlation coefficient 
of the two features. One of the advantages of the correlation coefficient is that we don’t need 
to worry about the units in which the features are measured. In contrast, measuring a feature 
representing distance in inches or miles can severely distort the covariance, if we don’t scale the 
other feature accordingly. 


Definition 8.4.2 (Sample correlation coefficient). Let {(xi, yi), (x2,2/2) ,..., (x n , y n )} be a data 
set where each example consists of two features. The sample correlation coefficient is defined as 


p{(xi,yi) ,■■■, ( x n ,y n )) 


COV ((xi, yi) , . • . , (x n , Un)) 
std (xi, ..., x n ) std (yi,...,y n )' 


( 8 . 8 ) 


By the Cauchy-Schwarz inequality (Theorem B.2.4), which states that for any vectors a and b 


-1 < 



(8.9) 


the magnitude of the sample correlation coefficient is bounded by one. If it is equal to 1 or 
-1, then the two centered data sets are collinear. The Cauchy-Schwarz inequality is related 
to the Cauchy-Schwarz inequality for random variables (Theorem 4.3.7), but here it applies to 
deterministic vectors. 

Figure 8.6 is annotated with the sample correlation coefficients corresponding to the two plots. 
Maximum and minimum temperatures within the same month are highly correlated, whereas 
the maximum temperature in January and August within the same year are only somewhat 
correlated. 
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8.5 Sample covariance matrix 


8.5.1 Definition 


We now consider sets of multidimensional data. In particular, we are interested in analyzing the 
variation in the data. The sample covariance matrix of a data set contains the pairwise sample 
covariance between every pair of features. 


Definition 8.5.1 (Sample covariance matrix). Let {x\, X 2 , ■ ■ ■, x n } be a set of d-dimensional 
real-valued data vectors. The sample covariance matrix of these data is the d x d matrix 


£ (fi, ...,x n ):= - - V (Xi - av (xi,. .. An)) Ai - av (x±,.. .,x n )) T . 

n — 1 z —' 


2=1 


The ( i,j) entry of the covariance matrix, where 1 < i,j < d, is given by 


E(x u ...,x n ) i:j 


Jvar { { x \) i ,. 

• • 5 ( x n)i ) 


II 

\cov ((Vi), 

, (*l)j) , • • • 

, ((^n)i,(®n)j)) 

if 


( 8 . 10 ) 


( 8 . 11 ) 


In order to characterize the variation of a multidimensional data set around its center, we 
consider its variation in different directions. The average variation of the data in a certain 
direction is quantified by the sample variance of the projections of the data onto that direction. 
Let v be a unit-norm vector aligned with a direction of interest, the sample variance of the data 
set in the direction of v is given by 


(->T- 

var (u xi,.. 

1 n 

.,v T x n )= ,yZ{ vTx i av (v T x\,. 

n — 1 

2=1 

. .,v T x n )y 

(8.12) 


1 n 

= ^ i vT ( x i av(xi,.., 

n — 1 

2=1 

■An))) 2 

(8.13) 


= v T ( av(xi,.. 

\ 2=1 

= u t £ (x x , ... ,x n )v. 

■An)) Ai ~ av(xi, . . 

■ An)) T ] V 



(8.14) 


Using the sample covariance matrix we can express the variation in every direction! This is 
a deterministic analog of the fact that the covariance matrix of a random vector encodes its 
variance in every direction. 


8.5.2 Principal component analysis 


Consider the eigendecomposition of the covariance matrix 


£ (xi, ...,x n )= [ui 


u 2 


U r . 


Ai 0 
0 A 2 

0 0 


[ui U 2 


U r . 


(8.15) 


By definition, £ (xi,..., x n ) is symmetric, so its eigenvectors ui, U 2 , ■ ■ ■, u n are orthogonal. By 
equation (8.14) and Theorem B.7.2, the eigenvectors and eigenvalues completely characterize 
the variation of the data in every direction. 
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Ai/n = 0.497 A 1 /n = 0.967 Ai/n = 1.820 

A 2 /n = 0.476 A 2 /n = 0.127 A 2 /n = 0.021 



Figure 8.7: PCA of a set consisting of n = 100 two-dimensional data points with different configurations. 


Theorem 8.5.2. Let the sample covariance of a set of vectors X (x\,... ,x n ) have an eigende- 
composition given by (8.15) where the eigenvalues are ordered X\ > A 2 > ... > \ n . Then, 

Ai = max var (v T x \,..., v T x n ) , (8.16) 

11^*112 ~ 1 

Hi = arg max var (y T x 1 ,..., v T x n ) , (8-17) 

11 ^112 ~ 1 

Afc = max var (y T x 1 ,..., v T x n ) , (8.18) 

||v|| 2 =l,u-Lui,...,Ufc _ 1 

Uk = ar g max var (y T x 1 ,..., v T x n ) . (8.19) 

||w|| 2 =l,u±Ci,...,C fc _i 


This means that u\ is the direction of maximum variation. The eigenvector U 2 corresponding 
to the second largest eigenvalue A 2 is the direction of maximum variation that is orthogonal 
to u\. In general, the eigenvector Uk corresponding to the fcth largest eigenvalue Afc reveals 
the direction of maximum variation that is orthogonal to u \, U 2 , ..., Uk-i- Finally, u n is the 
direction of minimum variation. 

In data analysis, the eigenvectors of the sample covariance matrix are usually called principal 
directions. Computing these eigenvectors to quantify the variation of a data set in different 
directions is called principal component analysis (PCA). Figure 8.7 shows the principal 
directions for several 2D examples. 

Figure 8.8 illustrates the importance of centering before applying PCA. Theorem 8.5.2 still holds 
if the data are not centered. However, the norm of the projection onto a certain direction no 
longer reflects the variation of the data. In fact, if the data are concentrated around a point 
that is far from the origin, the first principal direction tends be aligned with that point. This 
makes sense as projecting onto that direction captures more energy. As a result, the principal 
directions do not reflect the directions of maximum variation within the cloud of data. Centering 
the data set before applying PCA solves the issue. 

The following example explains how to apply principal component analysis to dimensionality re¬ 
duction. The motivation is that in many cases directions of higher variation are more informative 
about the structure of the data set. 
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Ai/n = 25.78 
A 2 /n = 0.790 


\i/n = 1.590 
A 2 /n = 0.019 




Figure 8.8: PCA applied to n = 100 2D data points. On the left the data are not centered. As a result 
the dominant principal direction u\ lies in the direction of the mean of the data and PCA does not reflect 
the actual structure. Once we center, ui becomes aligned with the direction of maximal variation. 



Projection onto first PC 



Projection onto (d-l)th PC 


Figure 8.9: Projection of 7-dimensional vectors describing different wheat seeds onto the first two (left) 
and the last two (right) principal directions of the data set. Each color represents a variety of wheat. 
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xi, x 


n 


U T x i, ..., U T x n V A 1 U T x i, ..., VA 1 U T x n 
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Figure 8.10: Effect of whitening a set of data. The original data are dominated by a linear skew (left). 
Applying U T aligns the axes with the eigenvectors of the sample covariance matrix (center). Finally, 
•\/A _1 reweights the data along those axes so that they have the same average variation, revealing the 
nonlinear structure that was obscured by the linear skew (right). 


Example 8.5.3 (Dimensionality reduction via PCA). We consider a data set where each data 
point corresponds to a seed which has seven features: area, perimeter, compactness, length of 
kernel, width of kernel, asymmetry coefficient and length of kernel groove. The seeds belong to 
three different varieties of wheat: Kama, Rosa and Canadian. 3 Our aim is to visualize the data 
by projecting the data down to two dimensions in a way that preserves as much variation as 
possible. This can be achieved by projecting each point onto the two first principal dimensions 
of the data set. 

Figure 8.9 shows the projection of the data onto the first two and the last two principal directions. 
In the latter case, there is almost no discernible variation. The structure of the data is much 
better conserved by the two first directions, which allow to clearly visualize the difference between 
the three types of seeds. Note however that projection onto the first principal directions only 
ensures that we preserve as much variation as possible, but it does not necessarily preserve useful 
features for tasks such as classification. A 

8.5.3 Whitening 

Whitening is a useful procedure for preprocessing data that contains nonlinear patterns. The 
goal is to eliminate the linear skew in the data by rotating and contracting the data along 
different directions in order to reveal its underlying nonlinear structure. This can be achieved 
by applying a linear transformation that essentially inverts the sample covariance matrix, so that 
the result is uncorrelated. The process is known as whitening, because random vectors with 
uncorrelated entries are often referred to as white noise. It is closely related to Algorithm 8.5.4 
for coloring random vectors. 

Algorithm 8.5.4 (Whitening). Let x\, x n be a set of d-dimensional data, which we assume 
to be centered and to have a full-rank covariance matrix. To whiten the data set we: 

1. Compute the eigendecomposition of the sample covariance matrix E (xi,..., x n ) = TJAU T . 
li The data can be found at https://archive.ics.uci.edu/ml/datasets/seeds. 
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2. Set yi := y/X 1 U T Xi, for i = 1,..., n, where 

VA: = 


\Ai o 
o 


o o 


so that £ (xi ,..., x n ) = Uy/X\/AU T . 


0 

0 


\/A n _ 


( 8 . 20 ) 


The whitened data set y \, ..., y n has a sample covariance matrix equal to the identity, 


£ (yi, ■■■,Vn ) 



(8.21) 

1 Fl rj~\ 

-- V] \/A 1 U T Xi(V A 1 U T Xi) 

n — 1 ' V / 

i=1 

(8.22) 

v'a’W 

(8.23) 

\/a t/ T s (xi,..., x n ) uVa 

(8.24) 

Va~ 1 u t uVaVau t uVa ~ 1 

(8.25) 

I. 

(8.26) 


Intuitively, whitening first rotates the data and then shrinks or expands it so that the average 
variation is the same in every direction. As a result, nonlinear patterns become more apparent, 
as illustrated by Figure 8.10. 







Chapter 9 


Frequentist Statistics 


The goal of statistical analysis is to extract information from data by computing statistics, 
which are deterministic functions of the data. In Chapter 8 we describe several statistics from 
a deterministic and geometric point of view, without making any assumptions about the data- 
generation process. This makes it very challenging to evaluate the accuracy of the acquired 
information. 

In this chapter we model the data-acquisition process probabilistically. This allows to ana¬ 
lyze statistical techniques and derive theoretical guarantees on their performance. The data 
are interpreted as realizations of random variables, vectors or processes (depending on the 
dimensionality). The information that we want to extract can then be expressed in terms of the 
joint distribution of these quantities. We consider this distribution to be unknown but fixed, 
taking a frequentist perspective. The alternative framework of Bayesian statistics is described 
in Chapter 10. 

9.1 Independent identically-distributed sampling 

In this chapter we consider one-dimensional real-valued data, modeled as the realization of an 
iid sequence. Figure 9.1 depicts the corresponding graphical model. This is a very popular 
assumption, which holds for controlled experiments, such as randomized trials to test drugs, 
and can often be a good approximation in other settings. However, in practice it is crucial to 
evaluate to what extent the independence assumptions of a model actually hold. 

The following example shows that measuring a quantity by sampling a subset of individuals 
randomly from a large population produces data satisfying the iid assumption, as long as we 
sample with replacement (if the population is large, sampling without replacement will have a 
negligible effect). 

Example 9.1.1 (Sampling from a population). Assume that we are studying a population of 
m individuals. We are interested in a certain quantity associated to each person, e.g. their 
cholesterol level, their salary or who they are voting for in an election. There are k possible 
values for the quantity {zi, Z 2 , ■ ■ ■, z^}, where k can be equal to m or much smaller. We denote 
by rrij the number of people for whom the quantity is equal to Zj, 1 < j < k. In the case of 
an election with two candidates, k would equal two and mi and m 2 would represent the people 
voting for each of the candidates. 


154 


CHAPTER 9. FREQUENTIST STATISTICS 


155 




Figure 9.1: Directed graphical model corresponding to an independent sequence. If the sequence is also 
identically distributed, then X 1; X 2 , ■ . ., X n all have the same distribution. 


Let us assume that we select n individuals independently at random with replacement, which 
means that one individual could be chosen more than once, and record the value of the quantity 
of interest. Under these assumptions the measurements can be modeled as a random sequence 
of independent variables X. Since the probability of choosing any individual is the same every 


time we make a selection, the first-order pmf of the sequence is 

Px(i) ( z j) = P (The ?'th measurement equals Zj) (9-1) 

People such that the quantity equals Zj 
Total number of people 

= —, 1 <j<k, (9.3) 

m 

for 1 < i < n by the law of total probability. We conclude that the data can be modeled as a 
realization of an iid sequence. A 


9.2 Mean square error 

We define an estimator as a deterministic function of the available data xi,X 2 , ■■■ ,x n which 
provides an approximation to a quantity associated to the distribution that generates the data 

y := h(xi,x 2 , ■ ■ ■ ,x n ) ■ (9.4) 

For example, as we will see, if we want to estimate the expectation of the underlying distribution, 
a reasonable estimator is the average of the data. Since we are taking a frequentist viewpoint, 
the quantity of interest is modeled as deterministic (in contrast to the Bayesian viewpoint which 
would model it as a random variable). For a fixed data set, the estimator is a deterministic 
function of the data. However, if we model the data as realizations of a sequence of random 
variables, then the estimator is also a realization of the random variable 

Y :=h(X 1 ,X 2 ,...,X n ). (9.5) 

This allows to evaluate the estimator probabilistically (usually under some assumptions on the 
underlying distribution). For instance, we can measure the error incurred by the estimator by 
computing the mean square of the difference between the estimator and the true quantity of 
interest. 

Definition 9.2.1 (Mean square error). The mean square error (MSE) of an estimator Y that 
approximates a deterministic quantity 7 £ M is 

MSE (Y) :=E ((T-7) 2 ) • 


(9.6) 



CHAPTER 9. FREQUENTIST STATISTICS 


156 


The MSE can be decomposed into a bias term and a variance term. The bias term is the 
difference between the quantity of interest and the expected value of the estimator. The variance 
term corresponds to the variation of the estimator around its expected value. 

Lemma 9.2.2 (Bias-variance decomposition). The MSE of an estimator Y that approximates 
7 E M satisfies 

MSE(Y) = E ({Y — E (T)) 2 ) + (E (Y) - 7 ) 2 . (9.7) 

bias 

variance 

Proof. The lemma is a direct consequence of linearity of expectation. □ 

If the bias is zero, then the estimator equals the quantity of interest on average. 

Definition 9.2.3 (Unbiased estimator). An estimator Y that approximates 7 E M is unbiased 
if its bias is equal to zero, i.e. if and only if 

E (Y) = r (9.8) 

An estimator may be unbiased but still incur in a large mean square error due to its variance. 

The following lemmas establish that the sample mean and variance are unbiased estimators of 
the true mean and variance of an iid sequence of random variables. 

Lemma 9.2.4 (The sample mean is unbiased). The sample mean is an unbiased estimator of 
the mean of an iid sequence of random variables. 

Proof. We consider the sample mean of an iid sequence X with mean //, 

1 n ~ 

y <"> : =n'E X{i) - 

i =1 

By linearity of expectation 

E ( f >)) = -E E (W) 

i— 1 

= A 


Lemma 9.2.5 (The sample variance is unbiased). The sample variance is 
of the variance of an iid sequence of random variables. 


(9.9) 

(9.10) 

(9.11) 

□ 

an unbiased estimator 


The proof of this result is in Section 9.7.1. 
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9.3 Consistency 

If we are estimating a scalar quantity, the estimate should improve as we gather more data. 
Ideally the estimate should converge to the true value in the limit when the number of data 
n —> oo. Estimators that achieve this are said to be consistent. 

Definition 9.3.1 (Consistency). An estimator Y (n) := h (1), X (2),..., X (n)^j that ap¬ 
proximates 7 S K is consistent if it converges to 7 as n —> 00 in mean square, with probability 
one or in probability. 


The following theorem shows that the mean is consistent. 

Theorem 9.3.2 (The sample mean is consistent). The sample mean is a consistent estimator 
of the mean of an iid sequence of random variables as long as the variance of the sequence is 
bounded. 


Proof. We consider the sample mean of an iid sequence X with mean //, 

1 n ~ 

y ( n ) := -E x w- ( 9 - 12 ) 

1=1 

The estimator is equal to the moving average of the data. As a result it converges to /1 in mean 
square (and with probability one) by the law of large numbers (Theorem 6.2.2), as long as the 
variance a 2 of each of the entries in the iid sequence is bounded. □ 

Example 9.3.3 (Estimating the average height). In this example we illustrate the consistency 
of the sample mean. Imagine that we want to estimate the mean height in a population. To be 
concrete we consider a population of ra := 25000 people. Figure 9.2 shows a histogram of their 
heights. 1 As explained in Example 9.1.1 if we sample n individuals from this population with 
replacement, then their heights form an iid sequence X. The mean of this sequence is 


nn 

E (X (i)) := Y, P (Person j is chosen) • height of person j (9.13) 

3 = 1 


1 

m 


m 


3= 1 


(9.14) 


= av (hi,..., h m ) (9.15) 

for 1 < i < n, where h\, ..., h m are the heights of the people. In addition, the variance 
is bounded because the heights are finite. By Theorem 9.3.2 the sample mean of the n data 
should converge to the mean of the iid sequence and hence to the average height over the whole 
population. Figure 9.3 illustrates this numerically. 


A 


If the mean of the underlying distribution is not well defined, or its variance is unbounded, 
then the sample mean is not necessarily a consistent estimator. This is related to the fact that 


lr rhe data are available here: wiki. stat .ucla.edu/socr/index.php/S0CR_Data_Dinov_020108_HeightsWeights. 
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Figure 9.2: Histogram of the heights of a group of 25 000 people. 




Figure 9.3: Different realizations of the sample mean when individuals from the population in Figure 9.2 
are sampled with replacement. 









CHAPTER 9. FREQUENTIST STATISTICS 


159 




Sample median 


3 p 
2 
1 
0 

-1 

-2 

-3- 

0 


Moving median 
— Median of iid seq. 




10 20 


30 


40 


50 




Figure 9.4: Realization of the moving average of an iid Cauchy sequence (top) compared to the moving 
median (bottom). 


the sample mean can be severely affected by the presence of extreme values, as we discussed 
in Section 8.2. The sample median, in contrast, tends to be more robust in such situations, as 
discussed in Section 8.3. The following theorem establishes that the sample median is consistent 
under the iid assumption, even if the mean is not well defined or the variance is unbounded. 
The proof is in Section 9.7.2. 

Theorem 9.3.4 (Sample median as an estimator of the median). The sample median is a 
consistent estimator of the median of an iid sequence of random variables. 

Figure 9.4 compares the moving average and the moving median of an iid sequence of Cauchy 
random variables for three different realizations. The moving average is unstable and does not 
converge no matter how many data are available, which is not surprising because the mean is 
not well defined. In contrast, the moving median does eventually converge to the true median 
as predicted by Theorem 9.3.4. 

The sample variance and covariance are consistent estimators of the variance and covariance 
respectively, under certain assumptions on the higher moments of the underlying distributions. 
This provides an intuitive interpretation for principal component analysis (see Section 8.5.2) un¬ 
der the assumption that the data are realizations of an iid sequence of random vectors: the prin¬ 
cipal components approximate the eigenvectors of the true covariance matrix (see Section 4.3.3), 
and hence the directions of maximum variance of the multidimensional distribution. Figure 9.5 
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n = 5 


n = 20 


n = 100 





Figure 9.5: Principal directions of n samples from a bivariate Gaussian distribution (red) compared to 
the eigenvectors of the covariance matrix of the distribution (black). 


illustrates this with a numerical example, where the principal components indeed converge to 
the eigenvectors as the number of data increases. 

9.4 Confidence intervals 

Consistency implies that an estimator will be perfect if we acquire infinite data, but this is of 
course impossible in practice. It is therefore important to quantify the accuracy of an estimator 
for a fixed number of data. Confidence intervals allow to do this from a frequentist point of 
view. A confidence interval can be interpreted as a soft estimate of the deterministic quantity 
of interest, which guarantees that the true value will belong to the interval with a certain 
probability. 

Definition 9.4.1 (Confidence interval). A 1 — a confidence interval X for'y £ M satisfies 

P (7 £ X) > 1 — a, (9.16) 


where 0 < a < 1 . 

Confidence intervals are usually of the form [Y — c,Y + c] where Y is an estimator of the quantity 
of interest and c is a constant that depends on the number of data. The following theorem derives 
a confidence interval for the mean of an iid sequence. The confidence interval is centered at the 
sample mean. 

Theorem 9.4.2 (Confidence interval for the mean of an iid sequence). Let X be an iid sequence 
with mean n and variance a 2 < b 2 for some b > 0. For any 0 < a < 1 

Y n := av (X ( 1 ), X ( 2 ),..., X (n)) , (9.17) 



is a l — a confidence interval for /i. 
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Proof. Recall that the variance of Y n equals Var (X n ) = a 2 /n (see equation (6.21) in the proof 
of Theorem 6.2.2). We have 


P 




i - p (JW - tA > 

a nVar (Y n ) 


'a n 


> 1 - 


b 2 


by Chebyshev’s inequality 


= 1 - 


a a 


b 2 


> 1 — a. 


(9.18) 

(9.19) 

(9.20) 

(9.21) 

□ 


The width of the interval provided in the theorem decreases with n for fixed ct, which makes 
sense as incorporating more data reduces the variance of the estimator and hence our uncertainty 
about it. 


Example 9.4.3 (Bears in Yosemite). A scientist is trying to estimate the average weight of the 
black bears in Yosemite National Park. She manages to capture 300 bears. We assume that the 
bears are sampled uniformly at random with replacement (a bear can be weighed more than 
once). Under this assumptions, in Example 9.1.1 we show that the data can be modeled as iid 
samples and in Example 9.3.3 we show the sample mean is a consistent estimator of the mean 
of the whole population. 

The average weight of the 300 captured bears is Y := 200 lbs. To derive a confidence interval 
from this information we need a bound on the variance. The maximum weight recorded for a 
black bear ever is 880 lbs. Let fi and cr 2 be the (unknown) mean and variance of the weights of 
the whole population. If X is the weight of a bear chosen uniformly at random from the whole 
population then X has mean fi and variance cr 2 , so 

a 2 = E (X 2 ) - E 2 (X) (9.22) 

< E (X 2 ) (9.23) 

< 880 2 because X < 880. (9.24) 


As a result, 880 is an upper bound for the standard deviation. Applying Theorem 9.4.2, 


Y - 


/a n 


-;Y + 


'a n 


= [-27.2,427.2] 


(9.25) 


is a 95% confidence interval for the average weight of the whole population. The interval is not 
very precise because n is not very large. A 


As illustrated by this example, confidence intervals derived from Chebyshev’s inequality tend to 
be very conservative. An alternative is to leverage the central limit theorem (CLT). The CLT 
characterizes the distribution of the sample mean asymptotically, so confidence intervals derived 
from it are not guaranteed to be precise. However, the CLT often provides a very accurate 
approximation to the distribution of the sample mean for finite n, as we show through some 
numerical examples in Chapter 6. In order to obtain confidence intervals for the mean of an iid 
sequence from the CLT as stated in Theorem 6.3.1 we would need to know the true variance of 
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the sequence, which is unrealistic in practice. However, the following result states that we can 
substitute the true variance with the sample variance. The proof is beyond the scope of these 
notes. 


Theorem 9.4.4 (Central limit theorem with sample standard deviation). Let X be an iid 
discrete random process with mean : = h such that its variance and fourth moment E(X (f) 4 ) 
are bounded. The sequence 


y/h (av (x (1 ),...,X (n)) - /a) 
std (x(l X(nj) 


(9.26) 


converges in distribution to a standard Gaussian random variable. 


Recall that the cdf of a standard Gaussian does not have a closed-form expression. To simplify 
notation we express the confidence interval in terms of the Q function. 

Definition 9.4.5 (Q function). Q (x) is the probability that a standard Gaussian random vari¬ 
able is greater than x for positive x, 

exp ^^ du, x > 0. (9.27) 

By symmetry, if U is a standard Gaussian random variable and y < 0 

P (U<y) = Q(-y). (9.28) 



Corollary 9.4.6 (Approximate confidence interval for the mean). Let X be an iid sequence that 
satisfies the conditions of Theorem 9.4-4- For any 0 < a < 1 



y n :=av(A(l),X(2),...,A(n)), 
S n := std (X(1),X(2 ),...,X(n)) , 


is an approximate 1 — a confidence interval for y, i.e. 


(9.29) 

(9.30) 

(9.31) 


P (n e In) ~ 1 - OL. 


(9.32) 


Proof. By the central limit theorem, when n —> oo X n is distributed as a Gaussian random 
variable with mean y and variance a 2 . As a result 


P {y £ T n ) — 


= l_p( ^ ~ b) 

V Fn 

« 1 — 2 Q ^ by Theorem 9.4.4 


- Tn Q " (f 


— 1 — OL. 


(9.33) 

(9.34) 

(9.35) 

(9.36) 

□ 
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It is important to stress that the result only provides an accurate confidence interval if n is large 
enough for the sample variance to converge to the true variance and for the CLT to take effect. 


Example 9.4.7 (Bears in Yosemite (continued)). The sample standard deviation of the bears 
captured by the scientist equals 100 lbs. We apply Corollary 9.4.6 to derive an approximate 
confidence interval that is tighter than the one obtained applying Chebyshev’s inequality. Given 
that Q (1.95) « 0.025, 



(9.37) 


is an approximate 95% confidence interval for the mean weight of the population of bears. 

A 


Interpreting confidence intervals is somewhat tricky. After computing the confidence interval in 
Example 9.4.7 one is tempted to state: 

The probability that the average weight is between 188.8 and 211.3 lbs is 0.95. 

However we are modeling the average weight as a deterministic quantity, so there are no random 
quantities in this statement! The correct interpretation is that if we repeat the process of 
sampling the population and compute the confidence interval many times, then the true value 
will lie in the interval 95% of the time. This is illustrated in the following example and Figure 9.6. 

Example 9.4.8 (Estimating the average height (continued)). Figure 9.6 shows several 95% 
confidence intervals for the average of the height population in Example 9.3.3. To compute 
each interval we select n individuals and then apply Corollary 9.4.6. The width of the intervals 
decreases as n grows, but because they are all 95% confidence intervals they all contain the true 
average with probability 0.95. Indeed this is the case for 113 out of 120 (94%) of the intervals 
that are plotted. 

A 


9.5 Nonparametric model estimation 

In this section we consider the problem of estimating a distribution from multiple iid samples. 
This requires approximating the cdf, pmf or pdf of the distribution. If we assume that the dis¬ 
tribution belongs to a predefined family, then the problem reduces to estimating the parameters 
that characterize that particular family, as we explain in detail in Section 9.6. Here we do not 
make such an assumption. Estimating a distribution directly is very challenging; clearly many 
(infinite!) different distributions could have generated the data. However with enough samples 
it is often possible to obtain models that produce an accurate approximation, as long as the iid 
assumption holds. 

9.5.1 Empirical cdf 

Under the assumption that a data set corresponds to iid samples from a certain distribution, a 
reasonable estimate for the cdf of the distribution at a given point x is the fraction of samples 
that are smaller than x. This results in a piecewise constant estimator known as the empirical 
cdf. 
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n = 50 


True mean 


n = 200 n = 1000 



Figure 9.6: 95% confidence intervals for the average of the height population in Example 9.3.3. 


Definition 9.5.1 (Empirical cdf). The empirical cdf corresponding to data x\, 


F n (x) 


1 

n 


n 

^ ' Ixi<Xl 
i= 1 


where i£l. 


x n is 

(9.38) 


The empirical cdf is an unbiased and consistent estimator of the true cdf. This is established 
rigorously in Theorem 9.5.2 below and illustrated empirically in Figure 9.7. The cdf of the height 
data from 25,000 people is compared to three realizations of the empirical cdf computed from 
different numbers of iid samples. As the number of available samples grows, the approximation 
becomes very accurate. 

Theorem 9.5.2. Let X be an iid sequence with marginal cdf Fx- For any fixed x € M F n (x) 
is an unbiased and consistent estimator of Fx (x). In fact, F n (x) converges in mean square to 
F x (x). 


Proof. First, we verify 


E (F„W)=E(l^l i(j)& ) 

= -E p (*«£x) 

2=1 

= F x (x ), 


by linearity of expectation 


(9.39) 

(9.40) 


(9.41) 
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Figure 9.7: Cdf of the height data in Figure 2.13 along with three realizations of the empirical cdf 
computed with n iid samples for n = 10,100,1000. 
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so the estimator is unbiased. We now estimate its mean square 


e (^M)=e(^££ 

1 X(i)<x 1 X(j) 

\ i= 1 3 = 1 

^p(x(«)<x)+i^ £ P [x(i)<x,X(j)< 


1 


n * 

1=1 

Fx (x) 1 


*=1 


+ ^E S F -W) ( x ) F X(J) ( x ) by independence 


n n * — — x{i) v-v ,Y(j) 
*=i 


= ftW + ^ 

n n 

The variance is consequently equal to 


We conclude that 


Var [F n (x)J = E (E n (x) 2 J - E 2 (F n (x 

= F x (x) (1 - F X (x)) 

n 


lim E ( ( Fx ( x ) — F n (x)) ) = lim Var ( F n (. x )) = 0. 


(9.42) 

(9.43) 

(9.44) 

(9.45) 

(9.46) 

(9.47) 

(9.48) 

□ 


9.5.2 Density estimation 

Estimating the pdf of a continuous quantity is much more challenging that estimating the cdf. 
If we have sufficient data, the fraction of samples that are smaller than a certain x provide a 
good estimate for the cdf at that point. However, no matter how much data we have, there is 
negligible probability that we will see any samples exactly at x: a pointwise empirical density 
estimator would equal zero almost everywhere (except at the available samples). 

Our only hope to produce an accurate estimator is if the pdf that we aim to estimate is smooth. 
In that case, we can estimate its value at a point x from observed samples that are situated at 
neighboring locations. If there are many samples close to x then this suggests that the estimate 
at x should be large, whereas if all the samples are far away, then it should be small. Kernel 
density estimation achieves this by averaging the samples. 

Definition 9.5.3 (Kernel density estimator). The kernel density estimate with bandwidth h of 
the distribution of x\, ..., x n at x G M is 

i =1 x 7 

where k is a kernel function centered at the origin that satisfies 

k(x) > 0 for all iGl, 


k (x) dx = 1. 


(9.50) 

(9.51) 
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Figure 9.8: Kernel density estimation for the Gaussian mixture described in Example 9.6.5 for different 
number of iid samples and different values of the kernel bandwidth h. 
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The effect of the kernel is to weight each sample according to their distance to the point at which 
we are estimating the pdf x. Choosing a rectangular kernel yields an empirical density estimate 
that is piecewise constant and roughly looks like a histogram (the corresponding weights are 
constant or equal to zero). A popular alternative is the Gaussian kernel k (x) = exp (— x 2 ) 
which produces a smooth density estimate. The kernel should decay so that k ((x — Xi) /h) is 
large when the sample x* is close to x and small when it is far. This decay is governed by the 
bandwidth h, which is chosen before hand based on our expectations about the smoothness of 
the pdf and on the amount of available data. If the bandwidth is very small, individual samples 
have a large influence on the density estimate. This allows to reproduce irregular shapes more 
easily, but also yields spurious fluctuations that are not present in the true curve, especially if we 
don’t have a lot of samples. Increasing the bandwidth smooths out such fluctuations and yields 
more stable estimates when the number of data is small. However, it may also over-smooth the 
estimate. As a rule of thumb, we should decrease the bandwidth of the kernel as the number of 
data increases. 

Figures 9.8 and 9.9 illustrate the effect of varying the bandwidth h at different sampling rates. 
In Figure 9.8 Gaussian kernel density estimation is applied to estimate the Gaussian mixture 
described in Example 9.6.5. Figure 9.9 shows an example where the same technique is used on 
real data: the aim is to estimate the density of the weight of a sea-snail population. 2 The whole 
population consists of 4,177 individuals. The kernel density estimate is computed from 200 iid 
samples for different values of the kernel bandwidth. 


9.6 Parametric model estimation 

In the previous section, we describe how to estimate a distribution by directly estimating the 
cdf or pdf generating the data. In this section, we discuss an alternative route based on the 
assumption that the type of distribution generating the data is known beforehand. If this is 
the case, the problem boils down to fitting the parameters characterizing the distribution to the 
data. Recall that from a frequentist viewpoint, the true distribution is fixed, so the corresponding 
parameters are modeled as deterministic quantities (in contrast, in a Bayesian framework they 
are modeled as random variables). 

9.6.1 The method of moments 

The method of moments adjusts the parameters of a distribution so that the moments of the 
distribution coincide with the sample moments of the data (i.e. its mean, mean square or 
variance, etc.). If the distribution only depends on one parameter, then we use the sample mean 
as a surrogate for the true mean and compute the corresponding value of the parameter. For an 
exponential with parameter A and mean /j we have 

f* = \- (9-52) 

Assuming that we have access to n iid samples x\,... ,x n from the exponential distribution, the 
method-of-moments estimate of A equals 

Amm := -7-r- 

av(xi, ...,x n ) 

2 The data are available at archive . ics .uci . edu/ml/datasets/Abalone 


(9.53) 
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Weight (grams) 



Figure 9.9: Kernel density estimate for the weight of a population of abalone, a species of sea snail. In 
the plot above the density is estimated from 200 iid samples using a Gaussian kernel with three different 
bandwidths. Black crosses representing the individual samples are shown underneath. In the plot below 
we see the result of repeating the procedure three times using a fixed bandwidth equal to 0.25. 
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Figure 9.10: Exponential distribution fitted to data consisting of inter-arrival times of calls at a call 
center in Israel (left). Gaussian distribution fitted to height data (right). 


The graph on the right of Figure 9.10 shows the result of fitting an exponential to the call-center 
data in Figure 2.11. Similarly, to fit a Gaussian using the method of moments we set the mean 
equal to its sample mean and the variance equal to the sample variance, as illustrated by the 
graph on the right of Figure 9.10 using the data from Figure 2.13. 


9.6.2 Maximum likelihood 

The most popular method for learning parametric models is maximum-likelihood fitting. The 
likelihood function is the joint pmf or pdf of the data, interpreted as a function of the unknown 
parameters. In more detail, let us denote the data by xi,...,x n and assume that they are 
realizations of a set of discrete random variables X \,..., X n which have a joint pmf that depends 
on a vector of parameters 9. To emphasize that the joint pmf depends on 6 we denote it by 
p* := pxi,...,x n - This pmf evaluated at the observed data 

Pff(x i,...,x n ) (9.54) 

is the likelihood function, when we interpret it as a function of 8. For continuous random 
variables, we use the joint pdf of the data instead. 

Definition 9.6.1 (Likelihood function). Given a realization x\,... ,x n of a set of discrete ran¬ 
dom variables X\,, X n with joint pmf pg, where 9 e M m is a vector of parameters, the 
likelihood function is 

An ,...,x n (o) ■=Pff(x 1 ,...,x n ). (9.55) 

If the random variables are continuous with pdf fg, where 9 £ W n , the likelihood function is 

£x u ...,x n (o) ■= fg (*1, • • • , X n ) . (9.56) 



The log-likelihood function is equal to the logarithm of the likelihood function log C X1 Xn 
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When the data are modeled as iid samples, the likelihood factors into a product of the marginal 
pmf or pdf, so the log likelihood can be decomposed into a sum. 

In the case of discrete distributions, for a fixed 9 the likelihood is the probability that ..., X n 
equal the observed data. If we don’t know 9 , it makes sense to choose a value for 9 such that this 
probability is as high as possible, i.e. to maximize the likelihood. For continuous distributions 
we apply the same principle to the joint pdf of the data. 

Definition 9.6.2 (Maximum-likelihood estimator). The maximum likelihood (ML) estimator 
for the vector of parameters 9 £ W n is 

9ml (xi, ... ,x n ) :=argma x£ Xl ,..., Xn (0) (9.57) 

e v 7 

= arg max log C Xl ,...,x n (9) . (9.58) 

e w 

The maximum of the likelihood function and that of the log-likelihood function are at the same 
location because the logarithm is monotone. 

Under certain conditions, one can show that the maximum-likelihood estimator is consistent: it 
converges in probability to the true parameter as the number of data increases. One can also 
show that its distribution converges to that of a Gaussian random variable (or vector), just like 
the distribution of the sample mean. These results are beyond the scope of the course. Bear in 
mind, however, that they only hold if the data are indeed generated by the type of distribution 
that we are considering. 

We now show how to derive the maximum-likelihood for a Bernoulli and a Gaussian distribution. 
The resulting estimators for the parameters are the same as the method-of-moments estimators 
(except for a slight difference in the estimate of the Gaussian variance parameter). 

Example 9.6.3 (ML estimator of the parameter of a Bernoulli distribution). We model a set 
of data xi,... ,x n as iid samples from a Bernoulli distribution with parameter 9 (in this case 


there is only one parameter). The likelihood function is equal to 

£x u ...,x n {9) =pe(x i,..., x n ) (9.59) 

= H(l Xz=1 9 + l Xi=0 (l-9)) (9.60) 

i =1 

= 9 ni (1 - 9) no (9.61) 

and the log-likelihood function to 

log £ Xl ,-, Xn ( 9 ) = ni log 9 + n 0 log (1 - 9) , (9.62) 

where n\ are the number of samples equal to one and no the number of samples equal to zero. 
The ML estimator of the parameter 9 is 

0ml = arg max log C X i,...,x n {9) (9.63) 

6 

= arg max ni log0 + no log (1 — 9). (9.64) 

6 
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We compute the derivative and second derivative of the log-likelihood function, 

dlog ( 0 ) _ ni n 0 
d 9 e i-e ’ 

d 2 log£ Xl ,... )Xn (0) = ni n 0 ^ 

d0 2 “0 2 (1 - 0) 2 


(9.65) 

(9.66) 


The function is concave, as the second derivative is negative. The maximum is consequently at 
the point where the first derivative equals zero, namely 


n i 


0ml = 

n 0 + n i 

The estimate is equal to the fraction of samples that are equal to one. 


(9.67) 


A 


Example 9.6.4 (ML estimator of the parameters of a Gaussian distribution). Let x\,X 2 ,--- 
be data that we wish to model as iid samples from a Gaussian distribution with mean // and 
standard deviation a. The likelihood function is equal to 


t-'Xx ,...,x n (/b °") — f[ i,<j (^1) • • • > %n) 


n 


1 ( x i~C 2 


L iv^F. 


e 2a- ! 


7T(J 


and the log-likelihood function to 


i r nlog(27r) 

log(bA =- 2 - n 

The ML estimator of the parameters /i and a is 


(xi - /x) 2 

7=1 


(9.68) 

(9.69) 


(9.70) 


{/xml^ml} = arg max log £ xl ,... )Xn (/x,ct) 
{/w 


= arg max —n log er — 2, 
{lx ’ a} i =i 


A - tY 

2 u 2 


We compute the partial derivatives of the log-likelihood function, 


dlog£s 1 „.„x„ (bA 
(9/j, 

^log^.-;. At) 

3ct 


n 

E Xi — fl 

A 2- ’ 

i=l 

n ( \2 
n (ab - /x) 

(7 "A (j3 

7=1 


(9.71) 

(9.72) 


(9.73) 

(9.74) 


The function we are trying to maximize is strictly concave in cr} . To prove this, we would 
have to show that the Hessian of the function is positive definite. We omit the calculations that 
show that this is the case. Setting the partial derivatives to zero we obtain 


1 

Mml = - Y ^Xi, 
n ' 
i =1 


2 _ 
°ml — 


1 


n 


y (®i - /xml)" 


7=1 


(9.75) 


(9.76) 
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Data 


Log-likelihood function 




— Estimated distribution 

— True distribution 

Data 


0.10 





0.15 

0.10 

0.05 

0.00 


— Estimated distribution 

— True distribution 
Data 




- 101.6 


- 104.0 


- 106.4 


- 108.8 


- 111.2 


- 113.6 


- 116.0 

- 118.4 


- 120.8 




- 93.9 

- 95.4 

- 96.9 

- 98.4 

- 99.9 

- 101.4 

- 102.9 

- 104.4 

- 105.9 

- 107.4 


Figure 9.11: The left column shows histograms of 50 iid samples from a Gaussian distribution, together 
with the pdf of the original distribution, as well as the maximum-likelihood estimate. The right column 
shows the log-likelihood function corresponding to the data and the location of its maximum and of the 
point corresponding to the true parameters. 
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Data 


Log-likelihood function 




Figure 9.12: The left image shows a histogram of 40 iid samples from the Gaussian mixture defined in 
Example 9.6.5, together with the pdf of the original distribution. The right image shows the log-likelihood 
function corresponding to the data, which has a local maximum apart from the global maximum. The 
density estimates corresponding to the two maxima are shown on the left. 


The estimator for the mean is just the sample mean. The estimator for the variance is a rescaled 
sample variance. 

A 


Figure 9.11 displays the log-likelihood function corresponding to 50 iid samples from a Gaussian 
distribution with fi := 3 and a := 4. It also shows the approximation to the true pdf obtained by 
maximum likelihood. In Examples 9.6.3 and 9.6.4 the log-likelihood function is strictly concave. 
This means that the function has a unique maximum that can be located by setting the gradient 
to zero. When this yields nonlinear equations that cannot be solved directly, we can leverage 
optimization methods such as gradient ascent that will converge to the maximum. However, 
the log-likelihood function is not always concave. As illustrated by the following example, in 
such cases it can have multiple local maxima, which may make it intractable to compute the 
maximum-likelihood estimator. 


Example 9.6.5 (Log-likelihood function of a Gaussian mixture). Let A be a Gaussian mixture 
defined as 


X : = 



with probability i, 
with probability |, 


(9.77) 


where G\ is a Gaussian random variable with mean —fj, and variance cr 2 , whereas G 2 is also 
Gaussian with mean fi and variance cr 2 . We have parameterized the mixture with just two 
parameters so that we can visualize the log-likelihood in two dimensions. Let x\, X 2 , ■ ■ ■ be data 
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modeled as iid samples from X. The likelihood function is equal to 


£-x\,...,x n (/b — f/i,cr ■ ■ ■ 1 %n) 


n 5\/27n 

i=i 


1 Ci+Cl 

e 2 CT 2 _|_ 


7TfJ 




_ Ci-E) 2 
-e 2 <t 2 


and the log-likelihood function to 

n / 

log£ *. 

1=1 X 


1 Ci+E 4 

e 2a2 -|- 

7T(T 5v27T(J 


_EizEl 

e IP 


(9.78) 

(9.79) 


(9.80) 


Figure 9.12 shows the log-likelihood function for 40 iid samples of the distribution when n := 4 
and a := 1. The function has a local maximum away from the global maximum. This means 
that if we use a local ascent method to find the ML estimator, we might not find the global 
maximum, but remain stuck at the local maximum instead. The estimate corresponding to the 
local maximum (shown on the left) has the same variance as the global maximum but /r is close 
to —4 instead of 4. Although the estimate doesn’t fit the data very well, it is locally optimal, 
small shifts of // and a yield worse fits (in terms of the likelihood). 

A 


To finish this section, we describe a machine-learning algorithm for supervised learning based 
on parametric fitting using ML estimation. 

Example 9.6.6 (Quadratic discriminant analysis). Quadratic discriminant analysis is an algo¬ 
rithm for supervised learning. The input to the algorithm are two sets of training data, consist¬ 
ing of d-dimensional vectors a\,... ,a n and b\,... ,b n which belong to two different classes (the 
method can easily be extended to deal with more classes). The goal is to classify new instances 
based on the structure of the data. 

To perform quadratic discriminant analysis we first fit a d-dimensional Gaussian distribution 
to the data of each class using the ML estimator for the mean and covariance matrix, which 
correspond to the sample mean and covariance matrix of the training data (up to a slight 
rescaling of the sample covariance). In more detail, a±,... ,a n are used to estimate a mean fl a 
and covariance matrix E a , whereas bi,...,b n are used to estimate fib and E?,, 

{fl a , E a } := arg max£ ai ,..,a n (/b £), (9.81) 

{/4,E 6 } := arg max £r g(/Z,E). (9.82) 

Then for each new example x, the value of the density function at the example for both classes 
is evaluated. If 


ffla, s a (®) > (®) (9-83) 

then x is declared to belong to the first class, otherwise it is declared to belong to the second 
class. Figure 9.13 shows the results of applying the method to data simulated using two Gaussian 
distributions. 


A 
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Figure 9.13: Quadratic-discriminant analysis applied to data from two different classes (left). The data 
corresponding to the two different classes are colored orange and blue. Three new examples are colored 
in black. Two bivariate Gaussians are fit to the data. Their contour lines are shown in the respective 
color of each class on the right. These distributions are used to classify the new examples, which are 
colored according to their estimated class. 


9.7 Proofs 

9.7.1 Proof of Lemma 9.2.5 

We consider the sample variance of an iid sequence X with mean /u, and variance a 2 , 
Y{n) := — 


4K&H 

i / o n ^ __ \ 

— A « + A E E x w ■v « - - E x « A '« 

V i=l k =1 3=1 ) 


(9.84) 

(9.85) 


n — 1 


(9.86) 
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To simplify notation, we denote the mean square E (X ( i ) 2 ) = /i 2 + cr 2 by £. We have 


Tl TL 71 Tl 

E ( ? (n) ) = £ 1 E (x (if) +1 £ E (x « 2 ) + ^ £ E :E (* 0) X 

i= 1 7=1 7=1 fc=l 


-- E(X<i) 2 )--Z E ( X (‘)XW 

3 = 1 

1 v-^ . n£ n(n— 1) u 2 2£ 2(n— 1) u 2 

-7 / £ 4-9~ “I-9--- 

— L ^ n z T 7 . n. n 


n — 1 


i=l 




n — 1 z —' n 

i =1 


= a 2 . 


9.7.2 Proof of Theorem 9.3.4 

We denote the sample median by Y (n). Our aim is to show that for any e > 0 


lim P 

n—yoo 


Y (n) — 7 


> e = 0. 


We will prove that 


lim P ( Y (n) > 7 + e) =0. 

n—>00 ' 


The same argument allows to establish 


lim P Y ( n) < 7 — e =0. 

n—>00 ' ' 


(9.87) 

(9.88) 

(9.89) 

(9.90) 

(9.91) 


(9.92) 


(9.93) 


(9.94) 


If we order the set jx (1),..., X (n)|, then Y (n) equals the (n + 1) /2th element if n is odd 

and the average of the n/2th and the (n/2 + l)th element if n is even. The event Y (n) > 7 + e 
therefore implies that at least (n + 1 ) /2 of the elements are larger than 7 + e. 

For each individual X (*), the probability that X (i) > 7 + e is 


p:=l- P~ (i) (7 + e) = 1/2 - e' 


(9.95) 


where we assume that e' > 0. If this is not the case then the cdf of the iid sequence is flat at 7 and 
the median is not well defined. The number of random variables in the set (1),..., X (n) | 
which are larger than 7 + e is distributed as a binomial random variable B n with parameters n 
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and p. As a result, we have 


P (Y (n) > 7 + e) < P 


n + 1 


or more samples are greater or equal to 7 + e 


= P B n > 


n + 1 


Ti H - 1 

= P ( B n — np > —-- np 


< P yB n - np\ >ne' + - 

< ^ ai (Bn) Chebyshev’s inequality 

(ne' + i ) 2 

np (1 — p) 


n A 


= p(i -p) 

n( e ' + ^) 2 ’ 

which converges to zero as n —>• 00 . This establishes (9.93). 


(9.96) 

(9.97) 

(9.98) 

(9.99) 

(9.100) 

(9.101) 

(9.102) 








Chapter 10 

Bayesian Statistics 


In the frequentist paradigm we model the data as realizations from a distribution that is fixed. In 
particular, if the model is parametric, the parameters are deterministic quantities. In contrast, 
in Bayesian parametric modeling the parameters are modeled as random variables. The goal is 
to have the flexibility to quantify our uncertainty about the underlying distribution beforehand, 
for example in order to integrate available prior information about the data. 

10.1 Bayesian parametric models 

In this section we describe how to fit a parametric model to a data set within a Bayesian 
framework. As in Section 9.6, we assume that the data are generated by sampling from known 
distributions with unknown parameters. The crucial difference is that we model the parameters 
as being random instead of deterministic. This requires selecting their prior distribution before 
fitting the data, which allows to quantify our uncertainty about the value of the parameters 
beforehand. A Bayesian parametric model is specified by: 

1. The prior distribution is the distribution of 0, which encodes our uncertainty about the 
model before seeing the data. 

2. The likelihood is the conditional distribution of X given 0, which specifies how the data 
depend on the parameters. In contrast to the frequentist framework, the likelihood is not 
interpreted as a deterministic function of the parameters. 

Our goal when learning a Bayesian model is to compute the posterior distribution of the 
parameters 0 given X. Evaluating this posterior distribution at the realization x allows to 
update our uncertainty about 0 using the data. 

The following example fits a Bayesian model to iid samples from a Bernoulli random variable. 

Example 10.1.1 (Bernoulli distribution). Let x be a vector of data that we wish to model as 
iid samples from a Bernoulli distribution. Since we are taking a Bayesian approach we choose 
a prior distribution for the parameter of the Bernoulli. We will consider two different Bayesian 
estimators 0i and © 2 : 
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Prior distribution no = 1, ni = 3 




no = 3, ni = 1 no = 91, ni = 9 




Figure 10.1: The prior distribution of 0i (blue) and 02 (dark red) in Example 10.1.1 are shown in the 
top-left graph. The rest of the graphs show the corresponding posterior distributions for different data 
sets. 


1. ©i represents a conservative estimator in terms of prior information. We assign a uniform 
pdf to the parameter. Any value in the unit interval has the same probability density: 


fe 1 (0) 


1 for 0 < 0 < 1 , 
0 otherwise. 


( 10 . 1 ) 


2. ©2 is an estimator that assumes that the parameter is closer to 1 than to 0. We could use 
it for instance to capture the suspicion that a coin is biased towards heads. We choose a 
skewed pdf that increases linearly from zero to one, 


f@2 W 


2 0 for 0 < 0 < 1 , 
0 otherwise. 


( 10 . 2 ) 
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By the iid assumption, the likelihood, which is just the conditional pmf of the data given the 
parameter of the Bernoulli, equals 


Px ,e O?|0) = 0" 1 (! - 0) no , (10-3) 

where n\ is the number of ones in the data and no the number of zeros (see Example 9.6.3). 
The posterior pdfs of the two estimators are consequently equal to 


where 


f&i\x (^) 


fe 2 \x (^It) 


/©i ( e )Px | 01 (x\ d ) 

Px ( f ) 

/©i {9)Px | 0 ! (x\ e ) 

/„/©! (u)pg | 01 (x\u) d u 
9 ni (1 — 9) n ° 
f u u ni (1 — u) n ° d u 
9 ni (1 - 9) n ° 

/3(m + l,n 0 + 1)’ 

/© 2 (O)Pjt |© 2 (®|°) 

Px ( f ) 

e n i +1 (i - e) n ° 

f u n n 1 ~^ 1 (1 — u) n ° d u 

e ni+1 (i - 9) n ° 

P (ni + 2,n 0 + 1)’ 


(10.4) 

(10.5) 

( 10 . 6 ) 

(10.7) 

( 10 . 8 ) 

(10.9) 

( 10 . 10 ) 

( 10 . 11 ) 


P (a, b) 



u) b 1 dn 


( 10 . 12 ) 


is a special function called the beta function or Euler integral of the first kind, which is tabulated. 

Figure 10.1 shows the plot of the posterior distribution for different values of n\ and uq. It 
also shows the maximum-likelihood estimator of the parameter, which is just n\/ (no + n\) (see 
Example 9.6.3). For a small number of flips, the posterior pdf of 02 is skewed to the right with 
respect to that of @i, reflecting the prior belief that the parameter is closer to 1. However for 
a large number of flips both posterior densities are very close. 

A 


10.2 Conjugate prior 

Both posterior distributions in Example 10.1.1 are beta distributions (see Definition 2.3.12), and 
so are the priors. The uniform prior of 0i is beta with parameters a = 1 and 6=1, whereas the 
skewed prior of 02 is beta distribution with parameters a = 2 and 6 = 1. Since the prior and 
the posterior belong to the same family, computing the posterior is equivalent to just updating 
the parameters. When the prior and posterior are guaranteed to belong to the same family of 
distributions for a particular likelihood, the distributions are called conjugate priors. 
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Definition 10.2.1 (Conjugate priors). A conjugate family of distributions for a certain likeli¬ 
hood satisfies the following property: if the prior belongs to the family, then the posterior also 
belongs to the family. 


Beta distributions are conjugate priors when the likelihood is binomial. 

Theorem 10.2.2 (The beta distribution is conjugate to the binomial likelihood). If the prior 
distribution of 0 is a beta distributions with parameters a and b and the likelihood of the data 
X given 0 is binomial with parameters n and x, then the posterior distribution of 0 given X is 
a beta distribution with parameters x + a and n — x + b. 


Proof. 


fe (0) p x | © {x | 9) 
px (x) 

fe {0)p x \e(x\9) 
f u fe (u) p x | e {x | u) du 

0 0 ” 1 (1 - 9) b ~ 1 ( 7 *)9 X (1 - 9) n ~ x 
f u u a - 1 ( 1 - ix) 6- 1 (>*( 1 - u) n ~ x du 

(10.13) 

(10.14) 

(10.15) 

Qx+a-l Q _ Q\n—x+b—l 
j^s+a -1 (1 _ u )n-*+ 6 -i du 

(10.16) 

fp (0; x + a,n — x + b). 

(10.17) 


fa 


Note that the posteriors obtained in Example 10.1.1 follow immediately from the theorem. 

Example 10.2.3 (Poll in New Mexico). In a poll in New Mexico for the 2016 US election, 429 
participants, 227 people intend to vote for Clinton and 202 for Trump (the data are from a real 
poll 1 , but for simplicity we are ignoring the other candidates and people that were undecided). 
Our aim is to use a Bayesian framework to predict the outcome of the election in New Mexico 
using these data. 

We model the fraction of people that vote for Trump as a random variable 0. We assume that 
the n people in the poll are chosen uniformly at random with replacement from the population, 
so given 0 = 0 the number of Trump voters is a binomial with parameters n and 0. We don’t 
have any additional information about the possible value of 0 , so we assume it is uniform or 
equivalently a beta distribution with parameters a := 1 and b := 1 . 

By Theorem 10.2.2 the posterior distribution of 0 given the data that we observe is a beta 
distribution with parameters a := 203 and b := 228, depicted in Figure 10.2. The corresponding 
probability that 0 > 0.5 is 11.4%, which is our estimate for the probability that Trump wins in 
New Mexico. 

A 

1 The poll results are taken from 

https : //www. abqjournal. com/883092/clinton-still-ahead-in-new-mexico.html 
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Figure 10.2: Posterior distribution of the fraction of Trump voters in New Mexico conditioned on the 
poll data in Example 10.2.3. 


10.3 Bayesian estimators 

The Bayesian approach to learning probabilistic models yields the whole posterior distribution 
of the parameters of interest. In this section we describe two alternatives for deriving a single 
estimate of the parameters from the posterior distribution. 

10.3.1 Minimum mean-square-error estimation 

The mean of the posterior distribution is the conditional expectation of the parameters given 
the data. Choosing the posterior mean as an estimator for the parameters 0 has a strong 
theoretical justification: it is guaranteed to achieve the minimum mean square error (MSE) 
among all possible estimators. Of course, this only holds if all of the assumptions hold, i.e. the 
parameters are generated according to the prior and the data are then generated according to 
the likelihood, which may not be the case for real data. 

Theorem 10.3.1 (The posterior mean minimizes the MSE). The posterior mean is the min¬ 
imum mean-square-error (MMSE) estimate of the parameter © given the data X. To be more 
precise, let us define 

^MMSE (^) := E(0|X = x). (10.18) 

For any arbitrary estimator # ot h e r {x), 

E ((tfotherPO - ©) 2 ) > E (^(e MMSE (X) -®yy (10.19) 

Proof. We begin by computing the MSE of the arbitrary estimator conditioned on X = x in 
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E (pother (A 

1 

to 

ii 

(10.20) 

= E (^(o othei (X) - <9mmse(A) + 6*mmse(-?) - ©) X = x) 

(10.21) 

= (Mother (®) - ^MMSe(^)) 2 + E^ ^mMSe(-A) ~ X = Sj 

+ 2 (Mother (*?) - ^MMSe(^)) E ^MMSeO?) - E|01 = ljj 

(10.22) 

— (Mother (®) 

By iterated expectation, 

- ^mmse(^)) 2 + E^ ^mmse(-A) — 0^ X = x'j. 

(10.23) 

E ( pother W-e) 2 ^) = E 

^(0 other (x)-©) 2 x)^j 

(10.24) 

= E 

pother (X) - 0MMSe(-^)) ^ + E ^E^ ^0mMSe(-^) - © 

)• 

= E 

pother (X) - 0MMSe(-^)) J + E ^6 , MMSe(A) - 

(10.25) 

— E 

^mmse(A) - ©) 'j , 

(10.26) 

since the expectation of a nonnegative quantity is nonnegative. 

□ 


Example 10.3.2 (Bernoulli distribution (continued)). In order to obtain point estimates for 
the parameter in Example 10.1.1 we compute the posterior means: 

E ^©i|X = x) = 


e(© 2 |X = x) = 


Figure 10.1 shows the posterior means for different values of no and n\. 

A 

10.3.2 Maximum-a-posteriori estimation 

An alternative to the posterior mean is the posterior mode, which is the maximum of the pdf 
or the pmf of the posterior distribution. 


6 f& 1 \x ( 0 |*) d0 

6 n 1+1 (1 - 9) n ° dd 
3(ni + 1 ,n 0 + 1 ) 
Ji\ + 2 , ?lQ + 1 ) 

; n 1 + 1 , n 0 + 1 )’ 

°fe 2 \x de 
Jii + 3, up + 1) 




(10.27) 

(10.28) 

(10.29) 

(10.30) 

(10.31) 
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Definition 10.3.3 (Maximum-a-posteriori estimator). The maximum-a-posteriori (MAP) esti¬ 
mator of a parameter 0 given data x modeled as a realization of a random vector X is 

Omap (x) := arg max 

e ' 

if 0 is modeled as a discrete random variable and 

Omap (x) := arg max /g, ^ 

e 1 

if it is modeled as a continuous random variable. 

In Figure 10.1 the ML estimator of 0 is the mode (maximum value) of the posterior distribution 
when the prior is uniform. This is not a coincidence, under a uniform prior the MAP and ML 
estimates are the same. 

Lemma 10.3.4. The maximum-likelihood estimator of a parameter 0 is the mode (maximum 
value) of the pdf of the posterior distribution given the data X if its prior distribution is uniform. 


9 I x 


6 I x 


(10.32) 


(10.33) 


Proof. We prove the result when the model for the data and the parameters is continuous, if 
any or both of them are discrete the proof is identical (in that case the ML estimator is the 
mode of the pmf of the posterior). If the prior distribution of the parameters is uniform, then 
f & (6) is constant for any 0, which implies 


arg max / ( 
9 


e|x 


x I = arg max 


■fe (^) fx\e (®l^) 


e f u fe( u ) fx\e(*\ u ) du 


(10.34) 


= arg max (x\(?) 

= arg max C%(9). 
e 


(the rest of the terms do not depend on 9) 

(10.35) 


□ 


Note that uniform priors are only well defined in situations where the parameter is restricted to 
a bounded set. 

We now describe a situation in which the MAP estimator is optimal. If the parameter 0 can 
only take a discrete set of values, then the MAP estimator minimizes the probability of making 
the wrong choice. 

Theorem 10.3.5 (MAP estimator minimizes the probability of error). Let 0 be a discrete 
random vector and let X be a random vector modeling the data. We define 

9map (x) := arg maxpg |^(0 | X = x). (10.36) 

6 

For any arbitrary estimator # ot h e r (x), 

P pother (x) + e) > p (Vap W + e) • 

In words, the MAP estimator minimizes the probability of error. 


(10.37) 
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Proof. We assume that X is a continuous random vector, but the same argument applies if it is 
discrete. We have 


P (e = = j.fx (£) P (e = Mother (£) \X = x} dx 

(10.38) 

— / fx ( x ) Pq \x (Mother {%) \ dx 

J X 

(10.39) 

< fx (®) P© I x ( 0 map (®) 1 

(10.40) 

= p (e = 0 MAP(X)), 

(10.41) 

where (10.40) follows from the definition of the MAP estimator as the mode of the posterior. □ 

Example 10.3.6 (Sending bits). We consider a very simple model for a communication channel 
in which we aim to send a signal 0 consisting of a single bit. Our prior knowledge indicates that 
the signal is equal to one with probability 1/4. 

Pe (1) = Pe (0) = 

(10.42) 

Due to the presence of noise in the channel, we send the signal n times. At the receptor we 
observe 

Xi = 0 + Zi, 1 < i < n, 

(10.43) 

where Z contains n iid standard Gaussian random variables. Modeling perturbations as Gaus¬ 
sian is a popular choice in communications. It is justified by the central limit theorem, under 
the assumption that the noise is a combination of many small effects that are approximately 
independent. 

We will now compute and compare the ML and MAP estimators of 0 given the observations. 

The likelihood is equal to 


n 

£x{0) = Ylfxi\e (** I 6 *) 

2=1 

W 1 Hi-e) 2 

~ /— e 2 

7=1 

(10.44) 

(10.45) 

It is easier to deal with the log-likelihood function, 


log C 3 (0)~ 2 2 log27r. 

(10.46) 

Since 0 only takes two values, we can compare directly. We will choose 0 ml (x) 

= 1 if 

log£,(l)= + 1 g log 27 t 

2=1 

(10.47) 


(10.48) 


2=1 


= log C s (0). 


(10.49) 
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Equivalently, 


#ML (x) 



otherwise. 


1 

2 ’ 


(10.50) 


The rule makes a lot of sense: if the sample mean of the data is closer to 1 than to 0 then our 
estimate is equal to 1. By the law of total probability, the probability of error of this estimator 
is equal to 


p (e / 0ml(X)) = P (© + 0mlW|© = 0 ) P (0 = 0) + P (0 / 0ml(X)|0 = 



= Q {Vn/2) , 


p (© = 1) 
1 ^P (0 = 1 ) 

(10.51) 


where the last equality follows from the fact that if we condition on 0 = 0 the empirical mean 
is Gaussian with variance cr 2 /n and mean 6 (see the proof of Theorem 6.2.2). 

To compute the MAP estimate we must find the maximum of the posterior pdf of 0 given the 
observed data. Equivalently, we find the maximum of its logarithm (this is equivalent because 
the logarithm is a monotone function), 


l °ZPe\x 


U'Lifx^eixi^PeiO) 

log -7 W) - 

n 

log fxi\e (^l 61 ) p& ( 0 ) - log fx ( f ) 

i=1 

V' x? - 2xi9 + e 2 n , , . 

- 2 ^ —-g- 2 og 27r + ogPe ^ ~ og ^ ^' 

i=1 


(10.52) 

(10.53) 

(10.54) 


We compare the value of this function for the two possible values of 0: 0 and 1. We choose 
#map (x) = 1 if 


logPeix (!|®) + lo g fx ( f ) = “ 


n x 2 — 2 Xi + 1 n 


i—1 


- 2 log27r _ log4 


> - — log 27 t — log 4 + log 3 

2=1 

= l°gP 0 |A (°l*) + lo g fx ( f ) ■ 


Equivalently, 


#MAP (x) = 


'1 if £ E?.i * > § + ¥. 

0 otherwise. 


(10.55) 

(10.56) 

(10.57) 


(10.58) 


The MAP estimate shifts the threshold with respect to the ML estimate to take into account 
that 0 is more prone to equal zero. However, the correction term tends to zero as we gather 
more evidence, so if a lot of data is available the two estimators will be very similar. 
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Figure 10.3: Probability of error of the ML and MAP estimators in Example 10.3.6 for different values 
of n. 


The probability of error of the MAP estimator is equal to 


P ( 0 / #MAP 


= Pl-E- ? ‘>5 + — 


n 


i =1 


n 


0 = 0 P (0 = 0) 


+ P - 




i= 1 


1 log 3 

<2 + Ar 


= (VS/2 + + \Q ( v ^/2 - 


0 = 1 ] P (0 = 1) 
log 3' 


= P 0/0 M ap(V)|0 = O P(0 = O) + P 0/0 M ap(V)|0 = 1 P (0 = 1) 


(10.59) 


(10.60) 


We compare the probability of error of the ML and MAP estimators in Figure 10.3. MAP 
estimation results in better performance, but the difference becomes small as n increases. 


A 











Chapter 11 

Hypothesis testing 


In a medical study we observe that 10% of the women and 12.5% of the men suffer from heart 
disease. If there are 20 people in the study, we would probably be hesitant to declare that 
women are less prone to suffer from heart disease than men; it is very possible that the results 
occurred by chance. However, if there are 20,000 people in the study, then it seems more likely 
that we are observing a real phenomenon. Hypothesis testing makes this intuition precise; it is 
a framework that allows us to decide whether patterns that we observe in our data are likely to 
be the result of random fluctuations or not. 

11.1 The hypothesis-testing framework 

The aim of hypothesis testing is to evaluate a predefined conjecture. In the example above, 
this could be that heart disease is more prevalent in men than in women. The hypothesis that 
our conjecture is false is called the null hypothesis, denoted by Hq. In our example, the null 
hypothesis would be that heart disease is at least as prevalent in men as in women. If the null 
hypothesis holds, then whatever pattern we are detecting in our data that seems to support our 
conjecture is just a fluke. There just happen to be a lot of men with heart disease (or women 
without) in the study. In contrast, the hypothesis under which our conjecture is true is known as 
the alternative hypothesis, denoted by H\. In this chapter we take a frequentist perspective: 
the hypotheses either hold or not , they are not modeled probabilistically. 

A test is a procedure to determine whether we should reject the null hypothesis or not based on 
the data. Rejecting the null hypothesis means that we consider unlikely that it happened, which 
is evidence in favor of the alternative hypothesis. If we fail to reject the null hypothesis, this 
does not mean that we consider it likely, we just don’t have enough information to discard it. 
Most tests produce a decision by thresholding a test statistic, which is a function that maps 
the data (i.e. a vector in W 1 ) to a single number. The test rejects the null hypothesis if the test 
statistic belongs to a rejection region 7 Z. For example, we could have 

1Z : = {t | t > 77 } , (11.1) 

where t is the test statistic computed from the data and rj is a predefined threshold. In this 
case, we would reject the null hypothesis only if t is larger than r/. 

As shown in Table 11.1, there are two possible errors that we can make. A Type I error is 
a false positive: our conjecture is false, but we reject the null hypothesis. A Type II error is 
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Reject Hq ? 



No 

Yes 

Hq is true 

© 

Type I error 

H i is true 

Type II error 

© 


Table 11.1: Type I and II errors. 


a false negative: our conjecture holds, but we do not reject the null hypothesis. In hypothesis 
testing, our priority is to control Type I errors. When you read in a study that a result is 
statistically significant at a level of 0.05, this means that the probability of committing a 
Type I error is bounded by 5%. 

Definition 11.1.1 (Significance level and size). The size of a test is the probability of making 
a Type I error. The significance level of a test is an upper bound on the size. 

Rejecting the null hypothesis does not give a quantitative sense of the extent to which the data 
are incompatible with the null hypothesis. The p value is a function of the data that plays this 
role. 

Definition 11.1.2 (p value). The p value is the smallest significance level at which we would 
reject the null hypothesis for the data we observe. 

For a fixed significance level, it is desirable to select a test that minimizes the probability of 
making a Type II error. Equivalently, we would like to maximize the probability of rejecting 
the null hypothesis when it does not hold. This probability is known as the power of the test. 

Definition 11.1.3 (Power). The power of a test is the probability of rejecting the null hypothesis 
if it does not hold. 

Note that in order to characterize the power of a test we need to know the distribution of 
the data under the alternative hypothesis, which is often unrealistic (recall that the alternative 
hypothesis is just the complement of the null hypothesis and consequently encompasses many 
different possibilities). 

The standard procedure to apply hypothesis testing in the applied sciences is the following: 

1 . Choose a conjecture. 

2. Determine the corresponding null hypothesis. 

3. Choose a test. 

4. Gather the data. 

5. Compute the test statistic from the data. 
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6 . Compute the p value and reject the null hypothesis if it is below a predefined limit (typically 
1% or 5%). 

Example 11.1.4 (Clutch). We want to test the conjecture that a certain player in the NBA 
is clutch , i.e. that he scores more points at the end of close games than during the rest of the 
game. The null hypothesis is that there is no difference in his performance. The test statistic t 
that we choose is whether he makes more or less points per minute in the last quarter than in 
the rest of the game 

n 

t{x) = Y, 1 * >0 , (H- 2 ) 

2—1 

where x) is the difference between the points per minute he scores in the 4th quarter and in the 
rest of the quarters of game i for 1 < i < n. 

The rejection region of the test is of the form 

IZ := {t ( x ) | t (x) > 77 } , (11-3) 

for a fixed threshold 77 . Under the null hypothesis the probability of scoring more points per 
minute in the 4th quarter is 1/2 (for simplicity we ignore the possibility that he scores the same 
number of points), so we can model the test statistic under the null hypothesis as a binomial 
random variable with parameters n and 1/2. If 77 is an integer between 0 and n, then the 
probability that the test statistic is in the rejection region if the null hypothesis holds is 

= < 1L4 > 
k=r] ' / 

So the size of the test is ^ (/). Table 11.2 shows this value for all possible values of 77 . If 

we want a significance level of 1% or 5% then we need to set the threshold at 77 = 16 or 77 = 15 
respectively. 

We gather the data from 20 games x and compute the value of the test statistic t (x) (note that 
we use a lowercase letter because it is a specific realization), which turns out to be 14 (he scores 
more points per minute in the fourth quarter in 14 of the games). This is not enough to reject 
the null hypothesis for our predefined level of 1% or 5%. Therefore the result is not statistically 
significant. 

In any case, we compute the p value, which is the smallest level at which the result would have 
been significant. From the table it is equal to 0.058. Note that under a frequentist framework we 
cannot interpret this as the probability that the null hypothesis holds (i.e. that the player is not 
better in the fourth quarter) because the hypothesis is not random, it either holds or it doesn’t. 
Our result is almost significant and although we do not have enough evidence to support our 
conjecture, it does seem plausible that the player performs better in the fourth quarter. 

A 


11.2 Parametric testing 

In this section we discuss hypothesis testing under the assumption that our data are sampled 
from a known distribution with unknown parameters. We again take a frequentist perspective, 


CHAPTER 11. HYPOTHESIS TESTING 


192 


V 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

p (To > r,) 

1.000 

1.000 

1.000 

0.999 

0.994 

0.979 

0.942 

0.868 

0.748 

0.588 

9 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

P (To > V) 

0.412 

0.252 

0.132 

0.058 

0.021 

0.006 

0.001 

0.000 

0.000 

0.000 


Table 11.2: Probability of committing a Type I error depending on the value of the threshold in 
Example 11.1.4. The values are rounded to three decimal points. 


as is usually done in most studies in the applied sciences. The parameter is consequently 
deterministic and so are the hypotheses: the null hypothesis is true or not, there is no such 
thing as the probability that the null hypothesis holds. 

To simplify the exposition, we assume that the probability distribution depends only on one 
parameter that we denote by 0. Pq is the probability measure of our probability space if 6 is 
the value of the parameter. X is a random vector distributed according to Pq. The actual data 
that we observe, which we denote by x is assumed to be a realization from this random vector. 

Assume that the null hypothesis is 9 = Oq. In that case, the size of a test with test statistic T 
and rejection region TZ is equal to 


a = P 9o (T(X) G . (11.5) 

For a rejection region of the form (11.1) we have 

a := P eo (t(X) > r 7 ) . (11.6) 

If the realization of the test statistic is T (aq,... ,x n ) then the significance level at which we 
would reject Hq would be 

p = Pe Q (T(X)>T(x)), (11.7) 


which is the p value if we observe x. The p value can consequently be interpreted as the 
probability of observing a result that is more extreme than what we observe in the data if the 
null hypothesis holds. 

A hypothesis of the form 9 = 9q is known as a simple hypothesis. If a hypothesis is of the form 
6 G S for a certain set S then the hypothesis is composite. For a composite null hypothesis 
9 G Ho we redefine the size and the p value in the following way, 


a = sup P q 

een 0 


p = sup 
e&H 0 


(t(x) > n 
Pe (T(X) >T(x 


( 11 . 8 ) 

(11.9) 


In order to characterize the power of the test for a certain significance level, we compute the 
power function. 
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Definition 11.2.1 (Power function). Let Pg be the probability measure parametrized by 9 and 
let TZ the rejection region for a test based on the test statistic T ( x). The power function of the 
test is defined as 

P(9):=p g (r(x)en) (li.io) 

Ideally we would like (5 (9) « 0 for 9 G Ho and /3 (9) « 1 for 6 G TLi. 

Example 11.2.2 (Coin flip). We are interested in checking whether a coin is biased towards 
heads. The null hypothesis is that for each coin flip the probability of obtaining heads is 9 < 1/2. 
Consequently, the alternative hypothesis is 9 > 1/2. Let us consider a test statistic equal to the 
number of heads observed in a sequence of n iid flips, 

n 

T{x) = Y j 1* =1 , (11-11) 

2=1 

where x t is one if the zth coin flip is heads and zero otherwise. A natural rejection region is 

T (x) > Tj. (11.12) 

In particular, we consider two possible thresholds 

1 . i] = n, i.e. we only reject the null hypothesis if all the coin flips are heads, 

2. = 3n/5, i.e. we reject the null hypothesis if at least three fifths of the coin flips are 
heads. 

What test should we use if the number of coin flips is 5, 50 or 100? Do the tests have a 5% 
significance level? What is the power of the tests for these values of n? 

To answer these questions, we compute the power function of the test for both options. If 7] = n, 

P 1 ( e ) = Pg (t(X) € nj (11.13) 

= 9 n . (11.14) 

If 7] = 3n/5, 

fo(e)= E ^V(i-0) n -*. (ii.i5) 

k=3n/5 

Figure 11.1 shows the two power functions. If r/ = n, then the test has a significance level of 
5% for the three values of n. However the power is very low, especially for large n. This makes 
sense: even if the coin is pretty biased the probability of n heads is extremely low. If tj = 3n/5, 
then for n = 5 the test has a significance level way above 5%, since even if the coin is not biased 
the probability of observing 3 heads out of 5 flips is quite high. However for large n the test has 
much higher power than the first option. If the bias of the coin is above 0.7 we reject the null 
hypothesis with high probability. 

A 
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Figure 11.1: Power functions for the tests described in Example 11.2.2. 


A systematic method for building tests under parametric assumptions is to threshold the ratio 
between the likelihood of the data under the null hypothesis and the likelihood of the data under 
the alternative hypothesis. If this ratio is high, the data are compatible with the null hypothesis, 
so it should not be rejected. 

Definition 11.2.3 (Likelihood-ratio test). Let L${Q) denote the likelihood function correspond¬ 
ing to a data vector x. Ho and Hi are the sets corresponding to the null and alternative hy¬ 
potheses respectively. The likelihood ratio is 

su P0e«i Ar ( 0 ) 

A likelihood-ratio test has a rejection region of the form {A (x) < 77 }, for a constant threshold r/. 

Example 11.2.4 (Gaussian with known variance). Imagine that you have some data that are 
well modeled as iid Gaussian with a known variance o. The mean is unknown and we are 
interested in establishing that it is not equal to a certain value po- What is the corresponding 
likelihood-ratio test and how should be set the threshold so that we have a significance level a? 

First, from Example 9.6.4 the sample mean achieves the maximum of the likelihood function of 
a Gaussian 


(11.16) 


av (x) := arg max (p, a) 
u 

for any value of 0 . Using this result, we have 

A m = SU P^ 0 AKm) 

^ SUp^Arfci) 

£x {ho) 

A? (av (x)) ’ 


(11.17) 


(11.18) 


(11.19) 
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Plugging in the expressions for the likelihood we obtain, 


A (x) = exp < — 


)~2o^ ^2 ((*< “ av (^)) 2 - (xi-fi 0 ) 2 ) 
l i=i 


= exp 


= exp 


-2 av (x) ^ x t + 


nay (x) — 


2/i 0 + 


i + np, o 


n (av (x) — /io) 2 

2 ^ 


( 11 . 20 ) 


( 11 . 21 ) 


( 11 . 22 ) 


Taking logarithms, the test is of the form 


I av (x) - hq\ > —2 log rj 


(11.23) 


The sample mean of n independent Gaussian random variables with mean p o and variance cr 2 
is Gaussian with mean ( 1 q and variance cr 2 /n, which implies 


a = P, 


av X - n o 


> \J—2 log j? 


= 2 Q \J—2 log r] 


If we fix a desired size a then the test becomes 


I av (x) - Mo| > Q 1 {a/2) 


(11.24) 


(11.25) 


(11.26) 


A motivating argument to employ the likelihood-ratio test is that if the null and alternative 
hypotheses are simple, then it is optimal in terms of power. 

Lemma 11.2.5 (Neyman-Pearson Lemma). If both the null hypothesis and the alternative hy¬ 
pothesis are simple, i.e. the parameter 9 can only have two values 9 q and 6\, then the likelihood- 
ratio test has the highest power among all tests with a fixed size. 

Proof. Recall that the power is the probability of rejecting the null hypothesis if it does not 
hold. If we denote the rejection region of the likelihood-ratio test by IZlr then its power is 

P dl (X € IZlr) . (11.27) 


P (h (XeTZ). (11.28) 

To prove that the power of the likelihood-ratio test is larger we only need to establish that 


Pe i (X £ IZlrj ■ 

Assume that we have another test with rejection region 77. Its power is equal to 


P 0] (X € 77 n IZ L r ) > P 0l [XeTZ c LR mZ 


( 11 . 29 ) 
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Let us assume that the data are continuous random variables (the argument for discrete random 
variables is practically the same) and that the pdf when the null and alternative hypotheses 
hold are fg 0 and fg 1 respectively. By the definition of the rejection region of the likelihood-ratio 
test, if A (x) &IZlr 


f (x) > (11.30) 

n 

whereas if A (x) G 1Z C LR 

fe,{x)<-—■ (11-31) 

T] 


If both tests have size a. then 

Pg Q (X elZ^j = a = Pg 0 (j£ G TZlrJ ■ 

and consequently 

Pe o (-X G TZ C n 7 ZlrJ = P() 0 (x G TZlr ) — Pe 0 [x G 77 n IZlr 


= Pe 0 (X G TZ ) — Pg 0 (A" G 77 n TZlr 


= p eo (xenmz c LR 


Now let us prove that (11.29) holds, 


p dl [x en c mz LR ) = ./», (£) dx 

J xeiz c m LR 


> - [ ho (x) dx by (11.30) 

? 7 Jx&iz c mi LR 

= ^ p do (xen c nn LR ) 

= P() 0 [x el 2. n 1Z‘ LR 


-i 

V Jx£'R.D'R.'l r 


by (11.35) 


fe 0 (x) dx 


> 


I xGTZmZ’i 


fe 1 (x) dx by (11.31) 


= Pg, [X elZrTZ' 


LR 


(11.32) 

(11.33) 

(11.34) 

(11.35) 

(11.36) 

(11.37) 

(11.38) 

(11.39) 

(11.40) 

(11.41) 

(11.42) 

□ 


11.3 Nonparametric testing: The permutation test 

In practical situations we may not able to design a parametric model that is adequate for 
our data. Nonparametric tests are hypothesis tests that do not assume that the data follow 
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any distribution with a predefined form. In this section we describe the permutation test, a 
nonparametric test that can be used to compare two data sets xa and xb in order to evaluate 
conjectures of the form xa is sampled from a distribution that has a higher mean than xb or xb 
is sampled from a distribution that has a higher variance than xa- The null hypothesis is that 
the two data sets are actually sampled from the same distribution. 

The test statistic in a permutation test is the difference between the values of a test statistic of 
interest t evaluated on the two data sets 

fdiff (x) := t (x A ) ~ t (x B ) , (11.43) 

where x are all the data merged together. Our goal is to test whether t (xa) is larger than t (xb) 
at a certain significance level. The corresponding rejection region is of the form TZ := {t 
The problem is how to fix the threshold so that the test has the desired significance level. 

Imagine that we randomly permute the labels A and B in the merged data set x. As a result, 
some of the data that were labeled as A will be labeled as B and vice versa. If we recompute 
tdis (x) we will obviously obtain a different value. However, the distribution of the random 
variable Cliff (A - ) under the hypothesis that the data are sampled from the same distribution 
has not changed. Indeed, the null hypothesis implies that the distribution of any function of 
X \, X- 2 -..., X n that only depends on the class assigned to each variable is invariant to permu¬ 
tations. More formally, the random sequence is exchangeable with respect to such functions. 

Consider the value of fdiff for all the possible permutations of the labels: Cliff,l > Cliff, 2 ) ■ ■ • Cliff,n!- If 
the null hypothesis holds, then it would be surprising to find that fdiff (x) is larger than most of 
the fdiff, i- In fact, under the null hypothesis, the random variable Cliff (A") is uniformly distributed 
in the set {fdiff,i> fdiff, 2 ; ■ ■ ■ Cliff,//!}• so that 

1 n! 

P (fdiff(A) > ^2 l*di s,i>V (11.44) 

i— 1 

This is exactly to the size of the test. We can therefore compute the p value of the observed 
statistic fdiff (x) as 


p = P 


> fdiff (x 


1 n - 

{E 1 

n! 4-' 


C-r) * 


i=l 


(11.45) 

(11.46) 


In words, the p value is the fraction of permutations that yield a more extreme test statistic 
than the one we observe. Unfortunately, it is often challenging to compute (11.46) exactly. 
Even for moderately sized data sets the number of possible permutations is usually too large 
(for example, 40! > 8 10 47 ) for it to be computationally tractable. In such cases the p value 
can be approximated by sampling a large number of permutations and making a Monte Carlo 
approximation of (11.46) with its average. 

Before looking at an example, let us review the steps to be followed when applying a permutation 
test. 


1. Choose a conjecture as to how xa and xb are different. 


CHAPTER 11. HYPOTHESIS TESTING 


198 



Cholesterol Blood pressure 

Figure 11.2: Histograms of the cholesterol and blood-pressure for men and women in Example 11.3.1. 

2. Choose a test statistic idiff- 

3. Compute tdiff (^)- 

4. Permute the labels m times and compute the corresponding values of fdiff : Cliff,i; idiff, 2 > 
■ ■ • ^difF,m' 

5. Compute the approximate p value 

P = P (hiiff(-Y) > tdiff (x)J (11.47) 

^ m 

~ 77 ^ (11.48) 

m i= 1 

and reject the null hypothesis if it is below a predefined limit (typically 1% or 5%). 

Example 11.3.1 (Cholesterol and blood pressure). A scientist want to determine whether men 
have higher cholesterol and blood pressure. She gathers data from 86 men and 182 women. 
Figure 11.2 shows the histograms of the cholesterol and blood-pressure for men and women. 
From the histograms it seems that men have higher levels of cholesterol and blood pressure. 
The sample mean for cholesterol is 261.3 mg/dl amongst men and 242.0 rng/dl amongst women. 
The sample mean for blood pressure is 133.2 mmHg amongst men and 130.6 mmHg amongst 
women. 

In order to quantify whether these differences are significant we compute the sample permutation 
distribution of the difference between the sample means using 10 6 permutations. To make sure 
that the results are stable, we repeat the procedure three times. The results are shown in 
Figure 11.3. For cholesterol, the p value is around 0.1%, so we have very strong evidence against 
the null hypothesis. In contrast, the p value for blood pressure is 13%, so the results are not 
very conclusive, we cannot reject the possibility that the difference is merely due to random 
fluctuations. 
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Cholesterol 


Blood pressure 


o.o7 : 

A. 

o.o5 : 

0.04 ; 

0.03 ; 

o.o2 : 

o.oi i 

0.001 --*- 1 - 

-20.00 -10.00 0.00 10.00 19.22 

p value = 0.119% 


0.20 


0.15 


0.10 


0.05 


0.001 --- *-— 

-5.0 0.0 2.6 5.0 


p value = 13.48% 


0.07 ; 

0.06 i 

0.05 i 

0.04 ; 

0.03 ; 

0.02 i 

o.oi i 

0.001 ----- 1 - 

-20.00 -10.00 0.00 10.00 19.22 

p value = 0.112% 

0.07 I 

0.06 ; 

0.05 i 

0.04 ; 

0.03 ; 

0.02 ! 

o.oi i 

0.001 -*-- 1 - 

-20.00 -10.00 0.00 10.00 19.22 


0.20 


0.15 


0.10 


0.05 


0.001 --*- *-— 

-5.0 0.0 2.6 5.0 


p value = 13.56% 

0.20 


0.15 


0.10 


0.05 


0.001--- *-— 

-5.0 0.0 2.6 5.0 


p value = 0.115% p value = 13.50% 

Figure 11.3: Approximate distribution under the null hypothesis of the difference between the sample 
means of cholesterol and blood pressure in men and women. The observed value for the test statistic is 
marked by a dashed line. 
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A 


11.4 Multiple testing 

In some applications, it is common to conduct many simultaneous hypothesis tests. For example, 
in computational genomics a researcher might be interested in testing whether any gene within 
a group of several thousand is relevant to a certain disease. If we apply a hypothesis test with 
size a. in this setting, then the probability of obtaining a false positive for a particular gene is 
a. Now, assume that we test n genes and that the events gene i is a false positive, 1 < i < n 
are all mutually independent. The probability of obtaining at least one false positive is 

P (at least one false positive) = 1 — P (no false positives) (11.49) 

= 1 — (1 — a) n . (11.50) 

For a = 0.01 and n = 500 this probability is equal to 0.99! If we want to control the probability 
of making a Type I error we must take into account that we are carrying out multiple tests at 
the same time. A popular procedure to do this is Bonferroni’s method. 

Definition 11.4.1 (Bonferroni’s method). Given n hypothesis tests, compute the corresponding 
p values p\,... ,p n . For a fixed significance level a reject the ith null hypothesis if 

Pi < -• (11.51) 

n 

The following lemma shows that the method guarantees that the desired significance level holds 
simultaneously for all the tests. 

Lemma 11.4.2. If we apply Bonferroni’s method, the probability of making a Type I error is 
bounded by a. 

Proof. The result follows directly from the union bound, which controls the probability of a 
union of events with the sum of their individual probabilities. 

Theorem 11.4.3 (Union bound). Let (f2, T, P) be a probability space and S\, S 2 ,... a collection 
of events in T. Then 

P(Ui$)<X;P(Si). (11.52) 

i 

Proof. Let us define the sets: 

Si = Si n n^r\S]. (11.53) 

It is straightforward to show by induction that U” =1 5j = U” =1 <Sj for any n, so UjS) = l fiSi. The 
sets S\, § 2 , ••• are disjoint by construction, so 

P (UjiS'j) = P P by Axiom 2 in Definition 1.1.4 (11.54) 

i 

< P (Si) because S) C Si. 


(11.55) 
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Applying the bound, 


P (Type I error) = P (U” =1 Type I error for test i ) 

n 

< P (Type I error for test i) by the union bound 
i=l 

a 

= n ■ — = a. 
n 


(11.56) 

(11.57) 

(11.58) 
□ 


Example 11.4.4 (Clutch (continued)). If we apply the test in Example 11.1.4 to 10 players, 
the probability that one of them seems to be clutch just due to chance increases substantially. 
To control for this, by Bonferroni’s method we must divide the p values of the individual tests 
by 10. As a result, to maintain a significance level of 0.05 we would require that each player 
score more points per minute during the last quarter in 17 of the 20 games instead of 15 (see 
Table 11.2) in order to reject the null hypothesis. 

A 


Chapter 12 

Linear Regression 


In statistics, regression is the problem of characterizing the relation between a certain quantity 
of interest y, called the response or the dependent variable, to several observed variables 
x\, X 2 , ■ ■ ■, x p , known as covariates, features or independent variables. For example, the 
response could be price of a house and the covariates could correspond to the extension, the 
number of rooms, the year it was built, etc. A regression model would describe how house prices 
are affected by all of these factors. 

More formally, the main assumption in regression models is that the predictor is generated 
according to a function h applied to the features and then perturbed by some unknown noise z, 
which is often additive, 

y = h(x) + z. (12-1) 


The aim is to learn h from n examples of responses and their corresponding features 



In this chapter we focus on the case where h is a linear function. 


( 12 . 2 ) 


12.1 Linear models 


If the regression function h in a model of the form 12.1 is linear, then the response is modeled 
as a linear combination of the predictors: 

y (i) =f(0r^* + 2 ,(t), 1 < i < n, (12.3) 


where z ® is an entry of the unknown noise vector. The function is parametrized by a vector of 
weights j3* £l p . All we need to fit the linear model to the data is to estimate these weights. 

Expressing the linear system (12.3) in matrix form yields the following representation of the 
linear-regression model 


y (1 ) 


IV 1 ) 

X 1 

. 

x 2 

• T (1) l 

*Lp 


i _ 


>)' 

„P> 

= 

*(2) 

x 1 

f( 2 ) . 

. r (2) 


02 

+ 


y(n) 


->(n) 

->{n) 

x 2 

1 

IT 


1 

1 _ 


z (n) 


(12.4) 
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Equivalently, 

y = Xp* + z, (12.5) 

where X is a n x p matrix containing the features, y contains the response and z G M n represents 
the noise. 

Example 12.1.1 (Linear model for GDP). We consider the problem of building a linear model 
to predict the gross domestic product (GDP) of a state in the US from its population and 
unemployment rate. We have available the following data: 

GDP Population Unemployment 

(USD millions) rate (%) 


North Dakota 

/ 

52 089 

757 952 

2.4 

\ 

Alabama 


204 861 

4 863 300 

3.8 


Mississippi 


107 680 

2 988 726 

5.2 


Arkansas 


120 689 

2 988 248 

3.5 


Kansas 


153 258 

2 907 289 

3.8 


Georgia 


525 360 

10 310 371 

4.5 


Iowa 


178 766 

3 134 693 

3.2 


West Virginia 


73 374 

1 831 102 

5.1 


Kentucky 


197 043 

4 436 974 

5.2 


Tennessee 

V 

??? 

6 651 194 

3.0 

/ 


In this example, the GDP is the response, and the population and the unemployment rate are 
the features. Our goal is to fit a linear model to the data so that we can predict the GDP of 
Tennessee, using a linear model. We begin by centering and normalizing the data. The averages 
of the response and of the features are 


av (y) = 179 236, av ( X ) = 
The empirical standard deviations are 

std (y) = 396 701, std ( X ) = 


3 802 073 4.1 


7 720 656 2.80 


( 12 . 6 ) 


(12.7) 


We subtract the average and divide by the standard deviations so that both the response and 
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the features are centered and on the same scale, 


-0.321 


-0.394 

-0.600 

0.065 


0.137 

-0.099 

-0.180 


-0.105 

0.401 

-0.148 


-0.105 

-0.207 

-0.065 

, x = 

-0.116 

-0.099 

0.872 


0.843 

0.151 

-0.001 


-0.086 

-0.314 

-0.267 


-0.255 

0.366 

0.045 


0.082 

0.401 


To obtain the estimate for the GDP of Tennessee we fit the model 

y-xp, (12.9) 


rescale according to the standard deviations (12.7) and recenter using the averages (12.6). The 
final estimate is 


y Ten = av (y) + std (; y ) 0^ 


( 12 . 10 ) 


where T^orm is centered using av (X) and normalized using std (X). 


A 


12.2 Least-squares estimation 


To calibrate the linear regression model, we need to estimate the weight vector so that it yields 
a good fit to the data. We can evaluate the fit for a specific choice of /3 £ R p using the sum of 
the squares of the error, 



1=1 



( 12 . 11 ) 


The least-squares estimate /3 ls is the vector of weights that minimizes this cost function, 


PhS ■= arg min y-Xfi 


( 12 . 12 ) 


The least-squares cost function is convenient from a computational view, since it is convex and 
can be minimized efficiently (in fact, as we will see in a moment it has a closed-form solution). 
In addition, it has intuitive geometric and probabilistic interpretations. Figure 12.1 shows the 
linear model learnt using least squares in a simple example where there is just one feature (p = 1) 
and 40 examples (n = 40). 
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Figure 12.1: Linear model learnt via least-squares fitting for a simple example where there is just one 
feature (p = 1) and 40 examples (n = 40). 


Example 12.2.1 (Linear model for GDP (continued)). The least-squares estimate for the re¬ 
gression coefficients in the linear GDP model is equal to 


/3ls 


1.019 


- 0.111 


(12.13) 


The GDP seems to be proportional to the population and inversely proportional to the unem¬ 
ployment rate. We now compare the fit provided by the linear model to the original data, as 
well as its prediction of the GDP of Tennessee: 


GDP Estimate 


North Dakota 

/ 52 089 

46 241 \ 

Alabama 

204 861 

239 165 

Mississippi 

107 680 

119 005 

Arkansas 

120 689 

145 712 

Kansas 

153 258 

136 756 

Georgia 

525 360 

513 343 

Iowa 

178 766 

158 097 

West Virginia 

73 374 

59 969 

Kentucky 

197 043 

194 829 

Tennessee 

\328 770 

345 352 / 
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Figure 12.2: Illustration of Corollary 12.2.3. The least-squares solution is a projection of the data onto 
the subspace spanned by the columns of X, denoted by X\ and AT 


A 


12.2.1 Geometric interpretation 

The following theorem, proved in Section 12.2.2, shows that the least-squares problem has a 
closed form solution. 

Theorem 12.2.2 (Least-squares solution). For p >n, if X is full rank then the solution to the 
least-squares problem (12.12) is 

/3 LS -.= {X T X)~ 1 X T y. (12.14) 

A corollary to this result provides a geometric interpretation for the least-squares estimate of 
y\ it is obtained by projecting the response onto the column space of the matrix formed by the 
predictors. 

Corollary 12.2.3. For p > n, if X is full rank then T/3ls Is the projection of y onto the column 
space of X. 

We provide a formal proof in Section 12.5.2 of the appendix, but the result is very intuitive. 
Any vector of the form X/3 is in the span of the columns of X. By definition, the least-squares 
estimate is the closest vector to y that can be represented in this way, so it is the projection of 
y onto the column space of X. This is illustrated in Figure 12.2. 

12.2.2 Probabilistic interpretation 

If we model the noise in (12.5) as a realization from a random vector Z which has entries that are 
independent Gaussian random variables with mean zero and a certain variance <r 2 , then we can 
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interpret the least-squares estimate as a maximum-likelihood estimate. Under that assumption, 
the data are a realization of the random vector 

Y:=Xj3 + Z, (12.15) 

which is an iid Gaussian random vector with mean X fj and covariance matrix cr 2 1. The joint 
pdf of Y is equal to 


n l 

i=l 


{2ir) n a 1 


exp 


2d 2 

1 

“2a 2 


dj — (x 


a-X/3 


(12.16) 

(12.17) 


The likelihood is the probability density function of Y evaluated at the observed data y and 
interpreted as a function of the weight vector 0, 


C, 


1 




: exp 


7 r 


y-xp 


(12.18) 


To hnd the ML estimate, we maximize the log likelihood. We conclude that it is given by the 
solution to the least-squares problem, since 


0ml = arg max Cy 

0 


= arg max log C t - 


P 

= argmin 

0 

= 0LS- 


y-X/3 


(12.19) 

( 12 . 20 ) 

( 12 . 21 ) 

( 12 . 22 ) 


12.3 Overfitting 

Imagine that a friend tells you: 

I found a cool way to predict the temperature in New York: It’s just a linear combination of the 
temperature in every other state. I fit the model on data from the last month and a half and it’s 
perfect! 

Your friend is not lying, but the problem is that she is using a number of data points to fit the 
linear model that is roughly the same as the number of parameters. If n < p we can hnd a 0 
such that y = X0 exactly, even if y and X have nothing to do with each other! This is called 
overfitting and is usually caused by using a model that is too flexible with respect to the number 
of data that are available. 

To evaluate whether a model suffers from overfitting we separate the data into a training set 
and a test set. The training set is used to fit the model and the test set is used to evaluate the 
error. A model that overfits the training set will have very low error when evaluated on the 
training examples, but will not generalize well to the test examples. 
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Figure 12.3 shows the result of evaluating the training error and the test error of a linear model 
with p = 50 parameters fitted from n training examples. The training and test data are generated 
by fixing a vector of weights (3* and then computing 


2/train — <^train P T '-train i 
2/test — “Ttest P i 


(12.23) 

(12.24) 


where the entries of <T tra i n , A tes t, Strain and ft* are sampled independently at random from 
a Gaussian distribution with zero mean and unit variance. The training and test errors are 
defined as 


error train 


Ttrain -Ts 2/train 


Strain 11 2 


errortest = 


Ttest .dl.S 2/test 


12/test 11 2 


(12.25) 

(12.26) 


Note that even the true /3* does not achieve zero training error because of the presence of the 
noise, but the test error is actually zero if we manage to estimate P* exactly. 

The training error of the linear model grows with n. This makes sense as the model has to fit 
more data using the same number of parameters. When n is close to p := 50, the fitted model 
is much better than the true model at replicating the training data (the error of the true model 
is shown in green). This is a sign of overfitting: the model is adapting to the noise and not 
learning the true linear structure. Indeed, in that regime the test error is extremely high. At 
larger n, the training error rises to the level achieved by the true linear model and the test error 
decreases, indicating that we are learning the underlying model. 


12.4 Global warming 

In this section we describe an application of linear regression to climate data. In particular, we 
analyze temperature data taken in a weather station in Oxford over 150 years. 1 Our objective is 
not to perform prediction, but rather to determine whether temperatures have risen or decreased 
during the last 150 years in Oxford. 

In order to separate the temperature into different components that account for seasonal effects 
we use a simple linear model with three predictors and an intercept 

Vt ~ Po + 01 cos + P '2 sin (^ypj + Pz t (12.27) 

where 1 < t < n denotes the time in months (n equals 12 times 150). The corresponding matrix 

1 The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/ 
oxforddata.txt. 
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Figure 12.3: Relative ^ 2 -norm error in estimating the response achieved using least-squares regression 
for different values of n (the number of training data). The training error is plotted in blue, whereas the 
test error is plotted in red. The green line indicates the training error of the true model used to generate 
the data. 

of predictors is 



1 

cos (^) 

sin(^) 

t\ 


df := 

1 

cos (**) 

sin(^) 

t-2 

(12.28) 


1 

cos (*&) 

sin(^) 

tn 



The intercept /?o represents the mean temperature, (3\ and fa account for periodic yearly fluctu¬ 
ations and /?3 is the overall trend. If /3 3 is positive then the model indicates that temperatures 
are increasing, if it is negative then it indicates that temperatures are decreasing. 

The results of fitting the linear model are shown in Figures 12.4 and 12.5. The fitted model 
indicates that both the maximum and minimum temperatures have an increasing trend of about 
0.8 degrees Celsius (around 1.4 degrees Fahrenheit). 

12.5 Proofs 

12.5.1 Proof of Proposition 12.2.2 

Let X = UE Vt be the singular-value decomposition (SVD) of X. Under the conditions of the 
theorem, (X T X^j 1 X T y = VE U T . We begin by separating y into two components 

y = UU T y + (I - UU T ) y 


(12.29) 













Temperature (Celsius) Temperature (Celsius) Temperature (Celsius) 
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Maximum temperature 


Minimum temperature 



I860 1880 1900 1920 1940 1960 1980 2000 


20 



• * Data 


-101 -*--*-*- 

1860 1880 1900 1920 1940 1960 


— Model 


1980 2000 






Figure 12.4: Temperature data together with the linear model described by (12.27) for both maximum 
and minimum temperatures. 
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Maximum temperature 


Minimum temperature 



+ 0.75 °C / 100 years + 0.88 °C / 100 years 

Figure 12.5: Temperature trend obtained by fitting the model described by (12.27) for both maximum 
and minimum temperatures. 


where UU T y is the projection of y onto the column space of X. Note that (/ — UU T ) y is 
orthogonal to the column space of X and consequently to both UU T y and X/3 for any /3. By 
Pythagoras’s Theorem 


y-xp 


= W~ uuT )y\\l + 


UU T y - Xp 


(12.30) 


The minimum value of this cost function that can be achieved by optimizing over beta is | 11 2 - 

This can be achieved by solving the system of equations 


UU T y = Xp = UT, V T p. 


(12.31) 


Since U T U = I because p > n, multiplying both sides of the equality yields the equivalent 
system 

U T y = X VtP- (12.32) 

Since X is full rank, X and V are square and invertible (and by definition of the SVD V _1 = V T ), 
so 


Pls = VE U T y 


(12.33) 


is the unique solution to the system and consequently also of the least-squares problem. 
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12.5.2 Proof of Corollary 12.2.3 

Let X = UT V T be the singular-value decomposition of X. Since X is full rank and p > n we 
have U T U = I, V T V = I and S is a square invertible matrix, which implies 


Xp hS = X(X T X) l X T y 

= UT V T (PS U t UE V t ) VT, U T y 
= UU T y. 


(12.34) 

(12.35) 

(12.36) 


Appendix A 

Set theory 


This chapter provides a review of basic concepts in set theory. 

A.l Basic definitions 

A set is a collection of objects. The set containing every possible object that we consider in a 
certain situation is called the universe and is usually denoted by Q. If an object x in 0 belongs 
to set S, we say that x is an element of S and write If x is not an element of S then 

we write x ^ S. The empty set, usually denoted by 0, is a set such that x ^ for all x £ II (i.e. 
it has no elements). If all the elements in a set B also belong to a set A then B is a subset of 
A, which we denote by B C A. If in addition there is at least one element of A that does not 
belong to B then B is a proper subset of A, denoted by B C A. 

The elements of a set can be arbitrary objects and in particular they can be sets themselves. 
This is the case for the power set of a set, defined in the next section. 

A useful way of defining a set is through a statement concerning its elements. Let S be the set 
of elements such that a certain statement s(x) holds, to define S we write 

S := {x | s(x)} . (A.l) 

For example, A := {x | 1 < x < 3} is the set of all elements greater than 1 and smaller than 3. 
Let us define some important sets and set operations using this notation. 

A.2 Basic operations 

Definition A. 2.1 (Set operations). 

• The complement S c of a set S contains all elements that are not in S. 

S c := {x | x $. S} . (A.2) 

• The union of two sets A and B contains the objects that belong to A or B. 

A(J B := {x | x G A or x G B} . (A.3) 
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This can be generalized to a sequence of sets A\, A- 2 , ... 

PJ A n := {x | x £ A n for some n} , (A..4) 

n 

where the sequence may be infinite. 

• The intersection of two sets A and B contains the objects that belong to A and B. 

A n B := {x | x £ A and x £ B} . (A.5) 

Again, this can be generalized to a sequence, 

A n ■= {x\x £ A n for all n} . (A.6) 

n 


• The difference of two sets A and B contains the elements in A that are not in B. 

A/B ■= {x \ x £ A and x B} . (A.7) 

• The power set 2 s of a set S is the set of all possible subsets of S, including 0 and S. 

2 s ■= {5' \S'CS}. (A.8) 

• The cartesian product of two sets S± and S 2 is the set of all ordered pairs of elements 
in the sets 


Si x S 2 := {(.ti,.t 2 ) I xi £ S\,X 2 £ S 2 } . (A.9) 

An example is M 2 = M x M, the set of all possible pairs of real numbers. 

Two sets are equal if they have the same elements, i.e. A = B if and only if A C B and B C A. 
It is easy to verify for instance that ( A c ) c = A, SUfi = Q, S11 fl = S or the following identities 
which are known as De Morgan’s laws. 

Theorem A. 2.2 (De Morgan’s laws). For any two sets A and B 

(A U B) c = A c n B c , 

(A n B) c = A c U B c . 

Proof. Let us prove the first identity; the proof of the second is almost identical. 

First we prove that (A U B) c C A c (lB c . A standard way to prove the inclusion of a set in another 
set is to show that if an element belongs to the first set then it must also belong to the second. 
Any element x in (AU B) c (if the set is empty then the inclusion holds trivially, since 0 C S for 
any set S ) is in A c ; otherwise it would belong to A and consequently to AU B. Similarly, x also 
belongs to B c . We conclude that x belongs to A c n B c , which proves the inclusion. 

To complete the proof we establish A c n B c C (A U B) c . If x £ A c n B c , then x ^ A and x ^ B, 
so x fi A U B and consequently x £ (A U B) c . □ 


(A.10) 
(A.11) 



Appendix B 

Linear Algebra 


This chapter provides a review of basic concepts in linear algebra. 


B.l Vector spaces 


You are no doubt familiar with vectors in R 2 or R 3 , i.e. 



r n 


-l 


2.2 


0 

X = 

3 

, y = 




5 


(B.l) 


From the point of view of algebra, vectors are much more general objects. They are elements of 
sets called vector spaces that satisfy the following definition. 


Definition B.1.1 (Vector space). A vector space consists of a set V and two operations + and 
■ satisfying the following conditions. 


1. For any pair of elements x, y £ V the vector sum x + y belongs to V. 

2. For any x £ V and any scalar «6l the scalar multiple a ■ x G V. 

3. There exists a zero vector or origin 0 such that x + 0 = x for any x £ V. 

4- For any x £ V there exists an additive inverse y such that x + y = 0, usually denoted as 
—x. 

5. The vector sum is commutative and associative, i.e. for all x, y £ V 

x + y = y + x, (x + y) + z = x + (y + z). (B.2) 

6. Scalar multiplication is associative, for any a, f3 £ R and x £ V 

a(/3 • x) = (a/3) • x. (B.3) 

7. Scalar and vector sums are both distributive, i.e. for all a,/3 £ R and x,y £ V 

(a + pi) ■ x = a ■ x + ■ x, a ■ (x + y) = a ■ x + a ■ y. (B.4) 
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A subspace of a vector space V is a subset of V that is also itself a vector space. 

From now on, for ease of notation we will ignore the symbol for the scalar product •, writing 
a ■ x as ax. 

Remark B.1.2 (More general definition). We can define vector spaces over an arbitrary field, 
instead of R, such as the complex numbers C. We refer to any linear algebra text for more 
details. 


We can easily check that 


is a valid vector space together with the usual vector addition and 

i T 


vector-scalar product. In this case the zero vector is the all-zero vector 


0 0 0 


When 


thinking about vector spaces it is a good idea to have R 2 or R 3 in mind to gain intuition, but 
it is also important to bear in mind that we can define vector sets over many other objects, 
such as infinite sequences, polynomials, functions and even random variables as in the following 
example. 


The definition of vector space guarantees that any linear combination of vectors in a vector 
space V, obtained by adding the vectors after multiplying by scalar coefficients, belongs to V. 
Given a set of vectors, a natural question to ask is whether they can be expressed as linear 
combinations of each other, i.e. if they are linearly dependent or independent. 


Definition B.1.3 (Linear dependence/independence). A set of m vectors xi,X 2 , ■ ■ ■ ,x m is lin¬ 
early dependent if there exist m scalar coefficients ai,ct 2 , ■ ■ ■, a m which are not all equal to zero 
and such that 


m 

^2aiXi = 0. (B.5) 

Otherwise, the vectors are linearly independent. 


Equivalently, at least one vector in a linearly dependent set can be expressed as the linear com¬ 
bination of the rest, whereas this is not the case for linearly independent sets. 

Let us check the equivalence. Equation (B.5) holds with olj 0 for some j if and only if 

Xj = — ^ OLiXi. (B.6) 

Ot' ‘ ^ 

3 i£{l,...,m}/{j} 


We define the span of a set of vectors {afi,..., x m } as the set of all possible linear combinations 
of the vectors: 


span (a?i,..., x m ) 


m 

— ^ •X'i 
2=1 


y I v = Yj 


for some cci, « 2 ,.. 



(B.7) 


This turns out to be a vector space. 

Lemma B.1.4. The span of any set of vectors x \,..., x m belonging to a vector space V is a 
subspace ofV. 
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Proof. The span is a subset of V due to Conditions 1 and 2 in Definition B.1.1. We now show 
that it is a vector space. Conditions 5, 6 and 7 in Definition B.1.1 hold because V is a vector 
space. We check Conditions 1, 2 , 3 and 4 by proving that for two arbitrary elements of the span 

m m 

Vi = Xj, V2 = ai,, a m , Pi,..(3 m e M, (B. 8 ) 

Z—1 i =1 

7 i ij\ + 72 V 2 also belongs to the span. This holds because 

m 

71 ?7i + 72 m = Y, (71 a i + 72 A) Xj, (B.9) 

Z=1 

so 71 y\ + 72 1)2 is in span (x\,..., x m ). Now to prove Condition 1 we set 71 = 72 = 1, for 
Condition 2 72 = 0, for Condition 3 71 = 72 = 0 and for Condition 4 71 = — 1,72 = 0. □ 

When working with a vector space, it is useful to consider the set of vectors with the smallest 
cardinality that spans the space. This is called a basis of the vector space. 

Definition B.1.5 (Basis). A basis of a vector space V is a set of independent vectors {x \,..., x m } 
such that 

V = span (xi,..., x m ). (B.10) 

An important property of all bases in a vector space is that they have the same cardinality. 

Theorem B.1.6. If a vector space V has a basis with finite cardinality then every basis of V 
contains the same number of vectors. 

This theorem, which is proved in Section B.8.1, allows us to define the dimension of a vector 
space. 

Definition B.1.7 (Dimension). The dimension dimly) of a vector space V is the cardinality 
of any of its bases, or equivalently the smallest number of linearly independent vectors that span 

V. 

This definition coincides with the usual geometric notion of dimension in M 2 and M 3 : a line 
has dimension 1, whereas a plane has dimension 2 (as long as they contain the origin). Note 
that there exist infinite-dimensional vector spaces, such as the continuous real-valued functions 
defined on [ 0 , 1 ]. 

The vector space that we use to model a certain problem is usually called the ambient space 
and its dimension the ambient dimension. In the case of the ambient dimension is n. 

Lemma B.1.8 (Dimension of M n ). The dimension ofW 1 is n. 

Proof. Consider the set of vectors e\,... ,e n C I" defined by 


1 

1—1 

1_ 


1 

0 

_1 


1- 

0 

1_ 

: 0 

H 

1 

H 

• 0 

1 

0 

1_ 


1 

0 

1_ 


-1 

T —1 

_1 


ei = 


(B.ll) 








APPENDIX B. LINEAR ALGEBRA 


218 


One can easily check that this set is a basis. It is in fact the standard basis of M n . □ 

B.2 Inner product and norm 

Up to now, the only operations we have considered are addition and multiplication by a scalar. 
In this section, we introduce a third operation, the inner product between two vectors. 

Definition B.2.1 (Inner product). An inner product on a vector space V is an operation (•,•) 
that maps pairs of vectors to M and satisfies the following conditions. 

• It is symmetric, for any x, y £ V 

(x,y) = (y,x). (B.12) 

• It is linear, i. e. for any a£l and any x, y,z 6 V 

{ax,y) = a(y,x) , (B.13) 

(x + y,z) = (x,z) + (y,z) . (B.14) 

• It is positive semidefinite: (x, x) is nonnegative for all x € V and if {x, x) = 0 then x = 0. 

A vector space endowed with an inner product is called an inner-product space. An important 
instance of an inner product is the dot product between two vectors x, y € R n as 

x-y:=y^x[i] y[i] , (B.15) 

i 

where x [i] is the itli entry of x. In this section we use x t to denote a vector, but in some other 

parts of the notes it may also denote an entry of a vector x; this will be clear from the context. 

It is easy to check that the dot product is a valid inner product. M n endowed with the dot 
product is usually called a Euclidean space of dimension n. 

The norm of a vector is a generalization of the concept of length. 

Definition B.2.2 (Norm). Let V be a vector space, a norm is a function ||-|| from V to M that 
satisfies the following conditions. 

• It is homogeneous. For all a 6 M and x € V 

||ax|| = \a\ ||a?||. (B.16) 

• It satisfies the triangle inequality 

P + y|| < ||z|| + ||y||- (B.17) 

In particular, it is nonnegative (set y = —x). 

• ||x|| = 0 implies that x is the zero vector 0. 

A vector space equipped with a norm is called a normed space. Distances in a normed space 
can be measured using the norm of the difference between vectors. 
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Definition B.2.3 (Distance). The distance between two vectors x and y in a normed space with 
norm I Ml is 


d(x,y) :=\\x-y\\. (B.18) 

Inner-product spaces are normed spaces because we can define a valid norm using the inner 
product. The norm induced by an inner product is obtained by taking the square root of the 
inner product of the vector with itself, 


||x||( v ) := a/ (x,x). (B.19) 

The norm induced by an inner product is clearly homogeneous by linearity and symmetry of 
the inner product. ||x||^..) = 0 implies x = 0 because the inner product is positive semidefinite. 
We only need to establish that the triangle inequality holds to ensure that the inner-product 
is a valid norm. This follows from a classic inequality in linear algebra, which is proved in 
Section B.8.2. 

Theorem B.2.4 (Cauchy-Schwarz inequality). For any two vectors x and y in an inner-product 
space 


\(x,y)\ < Pll( v ) ||y|| (v ) • (B.20) 

Assume ||x||^. \ / 0, 

(x,y) = - 

(x,y) = 


l®ll( v > l|y|l( v > 
l-ll<v>IMI(v> 


Mi'll v - 

y = -Tp=n— 

INI (v > 

. Iltfll< v >_ 

y = nV x. 

\\x /. ,\ 


(B.21) 

(B.22) 


Corollary B.2.5. 


The norm induced by an inner product satisfies the triangle inequality. 


Proof. 


x + y \\( v ) = Pll(.,.> + l|y|l(.,.> +2(x,y) 

- ii*ii<-,-> + HjC> + 2 ll f ll(v> llyil<v> 

= (ll*ll(v> + ll^l(v>) ■ 


(B.23) 

by the Cauchy-Schwarz inequality 

(B.24) 

□ 


The Euclidean or I 2 norm is the norm induced by the dot product in M n , 


|x|| 2 := Vx ■ x = 


A 

N i=1 


(B.25) 


In the case of M 2 or M 3 it is what we usually think of as the length of the vector. 
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B.3 Orthogonality 

An important concept in linear algebra is orthogonality. 

Definition B.3.1 (Orthogonality). Two vectors x and y are orthogonal if 

(x,y) = 0. (B.26) 

A vector x is orthogonal to a set S, if 

(x,s)= 0, foralls^S. (B.27) 

Two sets ofSuS 2 are orthogonal if for any x G S\,y G 52 

(x,y) = 0. (B.28) 

The orthogonal complement of a subspace S is 

S -1 := {x | (x,y) = 0 for all y G S} . (B.29) 

Distances between orthogonal vectors measured in terms of the norm induced by the inner 
product are easy to compute. 

Theorem B.3.2 (Pythagorean theorem). If x and y are orthogonal vectors 

P + y||( v ) = Pll( v ) + ||y||( v ) • (B.30) 

Proof. By linearity of the inner product 

P + y\\l,.) = Pll(,.> + l|y|l( v > + 2 (x,y) (B.3i) 

= PI!(.,.) +1 \y \!(■,.) • (b.32) 

□ 


If we want to show that a vector is orthogonal to a certain subspace, it is enough to show that 
it is orthogonal to every vector in a basis of the subspace. 

Lemma B.3.3. Let x be a vector and S a subspace of dimension n. If for any basis b\, 62 ,..., b n 
ofS, 

(x, b^j = 0, 1 < i < n, (B.33) 

then x is orthogonal to S. 


Proof. Any vector v 6 S can be represented as v = JA af =l bi for ai,..., a n G M, from (B.33) 


(x, v ) = 



(B.34) 


□ 
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We now introduce orthonormal bases. 

Definition B.3.4 (Orthonormal basis). A basis of mutually orthogonal vectors with norm equal 
to one is called an orthonormal basis. 


It is very easy to find the coefficients of a vector in an orthonormal basis: we just need to 
compute the dot products with the basis vectors. 

Lemma B.3.5 (Coefficients in an orthonormal basis). If {Hi,... ,u n } is an orthonormal basis 
of a vector space V, for any vector x G V 

n 

x = ^2 (hi, x) u t . (B.35) 

i =1 


Proof. Since {Hi, ..., u n \ is a basis, 


Immediately, 


m 

x = ^2 ai Hi for some 01 , 02 , •.., a m G M. 

I— 1 


(Hi,x) 



m 

^ ' CCj (Ui, uf) — Oj 
i= 1 


(B.36) 


(B.37) 


because (ui,Ui) = 1 and ( Ui,Uj) = 0 for i / j. 


□ 


For any subspace of M n we can obtain an orthonormal basis by applying the Gram-Schmidt 
method to a set of linearly independent vectors spanning the subspace. 

Algorithm B.3.6 (Gram-Schmidt). Consider a set of linearly independent vectors x\, ..., x m 
in M n . To obtain an orthonormal basis of the span of these vectors we: 


1. Set u\ := x\/ ||xi|| 2 . 

2. For i = 1 ,,m, compute 


and set Ui := u)/ ||uj|| 2 . 


i -1 

Vi := Xi-^2 (uj,Xi) Fj. 

3 =1 


(B.38) 


It is not difficult to show that the resulting set of vectors il i, ..., u m is an orthonormal basis for 
the span of x\, ..., x m . This implies in particular that we can always assume that a subspace 
has an orthonormal basis. 

Theorem B.3.7. Every finite-dimensional vector space has an orthonormal basis. 


Proof. To see that the Gram-Schmidt method produces an orthonormal basis for the span of 
the input vectors we can check that span {x \...., xf) = span {u\.... ,uf) and that Ui,..., u l is 
set of orthonormal vectors. □ 
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B.4 Projections 

The projection of a vector x onto a subspace S is the vector in S that is closest to x. In order 
to define this rigorously, we start by introducing the concept of direct sum. If two subspaces are 
disjoint, i.e. their only common point is the origin, then a vector that can be written as a sum 
of a vector from each subspace is said to belong to their direct sum. 

Definition B.4.1 (Direct sum). Let V be a vector space. For any subspaces Si,S 2 C V such 
that 


Si n S 2 = {0} 

the direct sum is defined as 

Si © S 2 ■■= {x | x = si + S2 s\ g <Si, s 2 g S 2 } . 

The representation of a vector in the direct sum of two subspaces is unique. 
Lemma B.4. 2. Any vector x G Si 0 S 2 has a unique representation 

x = s 1 + s 2 si G Si, s 2 G S 2 . 


(B.39) 


(B.40) 


(B.41) 


Proof. If x G Si © S 2 then by definition there exist s*i G Si,s 2 G S 2 such that x = s\ + s 2 . 
Assume x = vi + u 2 , v\ G Si,u 2 € S 2 , then si — ui = s 2 — u 2 . This implies that si — v\ and 
S2-V2 are in Si and also in S 2 . However, Si PlS 2 = {0}, so we conclude s*i = v\ and s 2 = u 2 . □ 


We can now define the projection of a vector x onto a subspace S by separating the vector into 
a component that belongs to S and another that belongs to its orthogonal complement. 

Definition B.4.3 (Orthogonal projection). Let V be a vector space. The orthogonal projection 
of a vector x G V onto a subspace S C V is a vector denoted by Vs x such that x — Vs x G S^. 

Theorem B.4.4 (Properties of orthogonal projections). Let V be a vector space. Every vector 
x G V has a unique orthogonal projection Vs x onto any subspace S C V of finite dimension. 
In particular x can be expressed as 


x = Vs x + V s ± x. 


For any vector s G S 


(x, s) = (V s x,s). 


(B.42) 


(B.43) 


For any orthonormal basis b±,... ,b m of S, 
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Proof. Let us denote the dimension of S by m. Since m is finite, the exists an orthonormal basis 
S : b\...., b' rn . Consider the vector 

m 

(B.45) 

i =1 


It turns out that x — p is orthogonal to every vector in the basis. For 1 < j < m, 


(x-p,b'^ = (B.46) 

m 

i= 1 

= (x, bj^ - (x, b'j^ = 0, (B.48) 


so x — p G S 1 - and p is an orthogonal projection. Since S n S 1 - = {0} 1 there cannot be two 
other vectors x\ S 5,fi E S 1 - such that x = x\ + X 2 so the orthogonal projection is unique. 

Notice that o := x — p is a vector in 5^ such that x — o = p is in S and therefore in . 

This implies that o is the orthogonal projection of x onto 5^ and establishes (B.42). 

Equation (B.43) follows immediately from the orthogonality of any vector s S S and Vsx. 
Equation (B.44) follows from (B.43). □ 


Computing the norm of the projection of a vector onto a subspace is easy if we have access to 
an orthonormal basis (as long as the norm is induced by the inner product). 


Lemma B.4.5 (Norm of the projection). The norm of the projection of an arbitrary vector 
x E V onto a subspace S C V of dimension d can be written as 


ll^5®ll< v > = 


\ 


E(L 


for any orthonormal basis b\,... ,bd of S. 


(B.49) 


Proof. By (B.44) 


W'PsxWl^ = {Vsx,V s x) 


\ i j 

d d 

(bj,x) (bi, bj 

i j 
d 

^2 (bi’ x 


(B.50) 

(B.51) 

(B.52) 

(B.53) 

□ 


1 For any vector v that belongs to both S and 5 X (v, v) = \\v\\^ = 0, which implies v = 0. 
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Example B.4.6 (Projection onto a one-dimensional subspace). To compute the projection 
of a vector x onto a one-dimensional subspace spanned by a vector v, we use the fact that 
ju/ P|| ( . ,^| is a basis for span ( v ) (it is a set containing a unit vector that spans the subspace) 
and apply (B.44) to obtain 


v. 


span (if) 


X = 


(v,x) 

11 -.112 

IMI<v> 


v. 


(B.54) 

A 


Finally, we prove that the projection of a vector x onto a subspace S is indeed the vector in S 
that is closest to x in the distance induced by the inner-product norm. 

Theorem B.4.7 (The orthogonal projection is closest). The orthogonal projection of a vector 
x onto a subspace S belonging to the same inner-product space is the closest vector to x that 
belongs to S in terms of the norm induced by the inner product. More formally, Vs x is the 
solution to the optimization problem 

minimize 

U 

subject to 

Proof. Take any point sG 5 such that s / Vs x 

p-s1l< v > = \\x - Vs x + Vs x - s\\ 2 ('^ (B.57) 

= ||x - Vsx ||^.) + \\P s x - s||( v} (B.58) 

> ||x — Vs x\\\ because s / Vs x, (B.59) 

where (B.58) follows from the Pythagorean theorem since because Vs± x := x — Vs x belongs to 
S'- 1 and Vs x — s to S. □ 


x — ft||(..) (B.55) 

u € S. (B.56) 


B.5 Matrices 

A matrix is a rectangular array of numbers. We denote the vector space ofmxn matrices by 
M mxn . We denote the zth row of a matrix A by A v , the jth column by A-j and the (i,j) entry 
by A t j . The transpose of a matrix is obtained by switching its rows and columns. 

Definition B.5.1 (Transpose). The transpose A T of a matrix A e M mxn is a matrix in 
A e R mxn 


{A T ).. = A ji . (B.60) 

A symmetric matrix is a matrix that is equal to its transpose. 

Matrices map vectors to other vectors through a linear operation called matrix-vector product. 
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Definition B.5.2 (Matrix-vector product). The product of a matrix A £ R mxn and a vector 
x £ M n is a vector Ax £ M n , such that 

n 

( A *)i = J2 a l*[j} ( b - 61 ) 

j=i 

= (A i: ,x), (B.62) 

i.e. the ith entry of Ax is the dot product between the ith row of A and x. 

Equivalently, 

n 

Ax = ^A.jxl)} , (B.63) 

3 =1 

i.e. Ax is a linear combination of the columns of A weighted by the entries in x. 

One can easily check that the transpose of the product of two matrices A and B is equal to the 
transposes multiplied in the inverse order, 

(. AB) T = B T A T . (B.64) 

We can express the dot product between two vectors x and y as 

(x, y) = x T y = y T x. (B.65) 


The identity matrix is a matrix that maps any vector to itself. 
Definition B.5.3 (Identity matrix). The identity matrix in M nxn is 


1 0 ••• 0 

0 1 ••• 0 

0 0 ••• 1 


(B.66) 


Clearly, for any x £ M n lx = x. 

Definition B.5.4 (Matrix multiplication). The product of two matrices A £ K mxn and B £ 
M nxp is a matrix AB £ M mxp , such that 

n 

{AB)^ = Y A ik B ki = (AiB : j) , (B.67) 

k= 1 


i.e. the (i,j) entry of AB is the dot product between the ith row of A and the jth column of B. 
Equivalently, the jth column of AB is the result of multiplying A and the jth column of B 

n 

AB = yA lk B kj = {A i: ,B :J ) , (B.68) 

k =1 

and ith row of AB is the result of multiplying the ith row of A and B. 
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Square matrices may have an inverse. If they do, the inverse is a matrix that reverses the effect 
of the matrix of any vector. 

Definition B.5.5 (Matrix inverse). The inverse of a square matrix A E W ixn is a matrix 
A~ l E M nxn such that 

AA- 1 = A' 1 A = I. (B.69) 

Lemma B.5.6. The inverse of a matrix is unique. 

Proof. Let us assume there is another matrix M such that AM = I, then 

M = A' 1 AM by (B.69) (B.70) 

= A -1 . (B.71) 

□ 


An important class of matrices are orthogonal matrices. 

Definition B.5.7 (Orthogonal matrix). An orthogonal matrix is a square matrix such that its 
inverse is equal to its transpose, 


U T U = UU T = I (B.72) 

By definition, the columns U : i, U- 2 ,..., U :n of any orthogonal matrix have unit norm and orthog¬ 
onal to each other, so they form an orthonormal basis (it’s somewhat confusing that orthogonal 
matrices are not called orthonormal matrices instead). We can interpret applying U T to a vector 
x as computing the coefficients of its representation in the basis formed by the columns of U. 
Applying U to U T x recovers x by scaling each basis vector with the corresponding coefficient: 

n 

$=UU T x = ^{U,i,x)U..i. (B.73) 

2=1 


Applying an orthogonal matrix to a vector does not affect its norm, it just rotates the vector. 


Lemma B.5.8 (Orthogonal matrices preserve the norm). 
and any vector f El", 

For any orthogonal matrix U E M nxn 

| |I/x| | 2 = p|| 2 • 

(B.74) 

Proof. By the definition of an orthogonal matrix 


\\Ux\\l = x T U T Ux 

(B.75) 

II 

(B.76) 

II -*11 2 

= NI2 ' 

(B.77) 


Oi 
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B.6 Eigendecomposition 

An eigenvector v of a matrix A satisfies 


Av = Xv (B.78) 

for a scalar A which is the corresponding eigenvalue. Even if A is real, its eigenvectors and 
eigenvalues can be complex. 

Lemma B.6.1 (Eigendecomposition). If a square matrix A e M nxn has n linearly independent 
eigenvectors v\,... ,v n with eigenvalues Ai,..., A n it can be expressed in terms of a matrix Q, 
whose columns are the eigenvectors, and a diagonal matrix containing the eigenvalues, 


Proof. 


A = 


V\ v 2 


Xi 0 

0 A 2 

0 0 


= QAQ 


-1 


i -1 


Vl V 2 


(B.79) 


(B.80) 


A Q = 

[Avi 

An 2 ■ ■ 

■ Av n J 

= 

X 1 v\ 

X 2 V 2 

•• A 2 u n 


= QA. 


(B.81) 

(B.82) 

(B.83) 


If the columns of a square matrix are all linearly independent, then the matrix has an inverse, 
so multiplying the expression by Q~ x on both sides completes the proof. □ 

Lemma B.6.2. Not all matrices have an eigendecomposition 


Proof. Consider for example the matrix 


0 1 
0 0 


(B.84) 


Assume A has a nonzero eigenvalue corresponding to an eigenvector with entries u[l] and v[2 ], 
then 


'v[2] 


0 1 


'*[!]' 


Xv[2] 

0 


0 0 


y[\ 


Xv[2] 


(B.85) 


which implies that v[2] = 0 and hence w[l] = 0, since we have assumed that A 7 ^ 0. This implies 
that the matrix does not have nonzero eigenvalues associated to nonzero eigenvectors. □ 
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An interesting use of the eigendecomposition is computing successive matrix products very fast. 
Assume that we want to compute 


AA ■ ■ ■ Ax = A k x, 


(B. 86 ) 


i.e. we want to apply A to x k times. A k cannot be computed by taking the power of its entries 
(try out a simple example to convince yourself). However, if A has an eigendecomposition, 


A k = QAQ-'QAQ- 1 ■ ■ ■ QAQ- 1 
= QA k Q~ 1 

■ 0 
■ 0 


= Q 


X k 0 


0 A* 


0 0 


Q~\ 


(B.87) 

(B. 88 ) 


(B.89) 


using the fact that for diagonal matrices applying the matrix repeatedly is equivalent to taking 
the power of the diagonal entries. This allows to compute the k matrix products using just 3 
matrix products and taking the power of n numbers. 

From high-school or undergraduate algebra you probably remember how to compute eigenvectors 
using determinants. In practice, this is usually not a viable option due to stability issues. A 
popular technique to compute eigenvectors is based on the following insight. Let A E R nxn be 
a matrix with eigendecomposition QAQ~ l and let x be an arbitrary vector in M n . Since the 
columns of Q are linearly independent, they form a basis for M n , so we can represent x as 

n 

x = Q-iQ-.il at 6 1 < i < n. (B.90) 

i —1 

Now let us apply A to x k times, 

n 

A k x = ^ aiA k Q :i 

i =1 
n 

i =1 

If we assume that the eigenvectors are ordered according to their magnitudes and that the 
magnitude of one of them is larger than the rest, |Ai| > |A 2 1 > ..and that on 7 ^ 0 (which 
happens with high probability if we draw a random x) then as k grows larger the term a\\ k Q-\ 
dominates. The term will blow up or tend to zero unless we normalize every time before applying 
A. Adding the normalization step to this procedure results in the power method or power 
iteration, an algorithm of great importance in numerical linear algebra. 

Algorithm B.6.3 (Power method). 

Input: A matrix A. 

Output: An estimate of the eigenvector of A corresponding to the largest eigenvalue. 


(B.91) 

(B.92) 
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Figure B.l: Illustration of the first three iterations of the power method for a matrix with eigenvectors 
V\ and V' 2 , whose corresponding eigenvalues are Ai = 1.05 and A 2 = 0.1661. 


Initialization: Set x\ := x/ ||x|| 2 , where the entries of x are drawn at random. 
For i = 1,..., k, compute 

_ Axi-i 

Xi '~ ll^i-i|| 2 ' 


(B.93) 


Figure B.l illustrates the power method on a simple example, where the matrix is equal to 


A = 


0.930 0.388 
0.237 0.286 


(B.94) 


The convergence to the eigenvector corresponding to the eigenvalue with the largest magnitude 
is very fast. 


B.7 Eigendecomposition of symmetric matrices 


Real symmetric matrices always have an eigendecomposition. In addition, their eigenvalues are 
real and their eigenvectors are all orthogonal. 

Theorem B.7.1 (Spectral theorem for real symmetric matrices). If A £ M nxn is symmetric, 
then it has an eigendecomposition of the form 


A = 


U\ U2 


Ai 0 
0 A 2 

0 0 


i T 


U\ U2 


Ur 


(B. 95 ) 


where the eigenvalues u\, 112 , ■ ■ ■, u n are real and the eigenvectors u\, U 2 , ■. ■, u n are real and 
orthogonal. 


Proof. The proof that every real symmetric matrix has n eigenvectors is beyond the scope of 
these notes. Under the assumption that this is the case, we begin by proving that the eigenvalues 
are real. Consider an arbitrary eigenvalue A* and the corresponding normalized eigenvector v l: 
we have 


v*Avi = A v*Vi = A, 

v*Avi = ( AFi)* Vi = (A Vi)* = A v*iJi = A. 


(B.96) 

(B.97) 
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This implies that A is real because A = A, so we can restrict the eigenvectors to be real (since the 
eigenvalue is real, both the real and imaginary parts of the eigenvector are eigenvectors them¬ 
selves and at least one of them must be nonzero). If several linearly independent eigenvectors 
have the same eigenvalue, an orthonormal basis of their span will also consist of eigenvectors of 
the matrix. All that is left to prove is that eigenvectors corresponding to different eigenvalues are 
orthogonal. Assume v t and 
then 


4 u j = 


are eigenvectors corresponding to different eigenvalues A,; / A j, 

ujuj = — (Aui) T Uj 

(B.98) 

= T -ujA T Uj 

Ai 

(B.99) 

= ^-ufAuj 

Ai 

(B.100) 

A j _ 

= -f-Uj Uj. 

Ai J 

(B.101) 

0. 

□ 


The eigenvalues of a symmetric matrix determine the value of the quadratic form: 

n 

q (x) := x T Ax = V' A i (x T Ui) 


i—1 


(B.102) 


If we order the eigenvalues Ai > A 2 > ... > \ n then the first eigenvalue is the maximum value 
attained by the quadratic if its input has unit I 2 norm, the second eigenvalue is the maximum 
value attained by the quadratic form if we restrict its argument to be normalized and orthogonal 
to the first eigenvector, and so on. 


Theorem B.7.2. For any symmetric matrix A G 
with corresponding eigenvalues Ai, A 2 ,..., A n 

Ai = max u t Au, 

Nl 2 =1 

v,\ = arg max u T Au, 


Afc = max u Au , 

11 u 11 2 =1 ,u .Lui ,... ,Uk _ 1 


with normalized eigenvectors 'il \, U 2 , • ■ •, u r 


u k = arg 


max uF Au. 


(B.103) 

(B.104) 

(B.105) 

(B.106) 


Proof. The eigenvectors are an orthonormal basis (they are mutually orthogonal and we assume 
that they have been normalized), so we can represent any unit-norm vector hj~ that is orthogonal 
to Hi ,..., Uk -1 as 


m 

hk = ^ ^ CXiUi 
i=k 


(B.107) 
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where 


hk 


J2 a i = 


i=k 


(B.108) 


by Lemma B.4.5. Note that h\ is just an arbitrary unit-norm vector. 

Now we will show that the value of the quadratic form when the normalized input is restricted 
to be orthogonal to u\,..., Uk-i cannot be larger than A k, 


n / m 


h^Ahk = ^ Xi I ^2 otjujuj I by (B.102) and (B.107) 

*=i \j=k ) 

n 

= "22 A iaf because u \, • ■ •, u m is an orthonormal basis 
2=1 

m 

< A k ^2 °P because A k > Afc_|_i > ... > X m 


i=k 


= A k , by (B.108). 


(B.109) 

(B.110) 

(B.lll) 

(B.112) 


This establishes (B.103) and (B.105). To prove (B.104) and (B.106) we just need to show that 
Uk achieves the maximum 


ulAUk = ^2 Ai (ufu k ) 2 (B.113) 

2=1 

= A fe . (B.114) 

□ 


B.8 Proofs 

B.8.1 Proof of Theorem B.1.6 

We prove the claim by contradiction. Assume that we have two bases {x ±,..., x m } and {yj,..., y n } 
such that m < n (or the second set has inhnite cardinality). The proof follows from applying 
the following lemma m times (setting r = 0, 1 ,..., m — 1 ) to show that {yj, ..., y m } spans V 
and hence {yj,..., y n } must be linearly dependent. 

Lemma B.8.1. Under the assumptions of the theorem, if {yj, y 2 ,..., y r , x r +i, ■ ■ ■, x m } spans V 
then {yj,..., y r +i, x r + 2 , ■ ■ •, x m } also spans V (possibly after rearranging the indices r+1 ,..., rn) 
for r = 0,1,..., m — 1. 

Proof. Since {yj, y 2 ,..., y r , x r+1 ,x m } spans V 

r m 

Vr+l = 'y ] Pi yi + y ] 7i Xi, Pi, . . . , Pr, 7r+l) • ■ • > 7 m £ 
i=1 i=r+1 


(B.115) 
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where at least one of the 7 -,- is non zero, as {y\,... ,y n } is linearly independent by assumption. 
Without loss of generality (here is where we might need to rearrange the indices) we assume 
that 7 r _|_i / 0 , so that 


X r -\-l 


7r+l 


Y,PiVi~ ^2 

\i=l i=r-\- 2 / 


(B. 116 ) 


This implies that any vector in the span of {yi,ij 2 , ■ ■ ■ ,y r ,x r+ \,... ,x m }, i.e. in V, can be 
represented as a linear combination of vectors in {y \,..., y r +i, x r + 2 , • • •, x m }, which completes 
the proof. □ 


B.8.2 Proof of Theorem B.2.4 


If ||i?||(.= 0 then x = 0 because the inner product is positive semidefinite, which implies 
(x, y) = 0 and consequently that (B.20) holds with equality. The same is true if ||y||^. \ = 0. 

Now assume that ||x||^. /0 and ||y||^ \ / 0. By semidefiniteness of the inner product, 

2 

ll£ll< v )*+ ll®ll( v >y 


0 < 
0 < 


= 2 PII?...> Mlh + 2 PI 


l|y||< v p- Pll( v >y 


l(v> 
12 


Kv> 
12 

!<•.*> 


= 2||s||/ v I|y||/. A - 2||x||/ A ||y||/. \ (x,y). 


<v> 

<v> 


(.,.> (x, y), 

<v> 


(B.117) 

(B.118) 


These inequalities establish (B.20). 

Let us prove (B.21) by proving both implications. 

(=>) Assume (x,y) = — p||(..) ||j7||/. .)■ Then (B.117) equals zero, so ||y||^ \ X = — ||x||^.^y 
because the inner product is positive semidefinite. 

(<=) Assume ||y||^. \X = — ||af||(. y. Then one can easily check that (B.117) equals zero, which 
implies (x,y) = - p|| (v) | |y||(.,.)■ 

The proof of (B.22) is identical (using (B.118) instead of (B.117)). 











